116 39 35MB
English Pages 505 Year 2023
Studies in Computational Intelligence 1120
Boris Kryzhanovsky · Witali Dunin-Barkowski · Vladimir Redko · Yury Tiumentsev · Valentin Klimov Editors
Advances in Neural Computation, Machine Learning, and Cognitive Research VII Selected Papers from the XXV International Conference on Neuroinformatics, October 23–27, 2023, Moscow, Russia
Studies in Computational Intelligence Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
1120
The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, self-organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.
Boris Kryzhanovsky · Witali Dunin-Barkowski · Vladimir Redko · Yury Tiumentsev · Valentin Klimov Editors
Advances in Neural Computation, Machine Learning, and Cognitive Research VII Selected Papers from the XXV International Conference on Neuroinformatics, October 23–27, 2023, Moscow, Russia
Editors Boris Kryzhanovsky Scientific Research Institute for System Analysis Russian Academy of Sciences Moscow, Russia Vladimir Redko Scientific Research Institute for System Analysis Russian Academy of Sciences Moscow, Russia
Witali Dunin-Barkowski Scientific Research Institute for System Analysis Russian Academy of Sciences Moscow, Russia Yury Tiumentsev Moscow, Russia
Valentin Klimov Moscow Engineering Physics Institute Moscow, Russia
ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-031-44864-5 ISBN 978-3-031-44865-2 (eBook) https://doi.org/10.1007/978-3-031-44865-2 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.
Preface
The international conference “Neuroinformatics” is the annual multidisciplinary scientific forum dedicated to the theory and applications of artificial neural networks, the problems of neuroscience and biophysics systems, artificial intelligence, adaptive behavior and cognitive studies. The scope of the conference is wide, ranging from theory of artificial neural networks, machine learning algorithms and evolutionary programming to neuroimaging and neurobiology. Main topics of the conference cover theoretical and applied research from the following fields: neurobiology and neurobionics: cognitive studies, neural excitability, cellular mechanisms, cognition and behavior, learning and memory, motivation and emotion, bioinformatics, adaptive behavior and evolutionary modeling, brain–computer interface; neural networks: neurocomputing and learning, paradigms and architectures, biological foundations, computational neuroscience, neurodynamics, neuroinformatics, deep learning networks, neuro-fuzzy systems, hybrid intelligent systems; machine learning: pattern recognition, Bayesian networks, kernel methods, generative models, information theoretic learning, reinforcement learning, relational learning, dynamical models, classification and clustering algorithms, self-organizing systems; applications: medicine, signal processing, control, simulation, robotics, hardware implementations, security, finance and business, data mining, natural language processing, image processing and computer vision. More than 120 reports were presented at the Neuroinformatics-2023 Conference. Of these, 52 papers were selected, for which articles were prepared and published in this volume. Boris Kryzhanovsky Witali Dunin-Barkowski Vladimir Redko Yury Tiumentsev Valentin Klimov
Organization
Editorial Board Boris Kryzhanovsky Witali Dunin-Barkowsky Vladimir Red’ko Yury Tiumentsev Valentin Klimov
Scientific Research Institute for System Analysis of Russian Academy of Sciences, Moscow The Moscow Institute of Physics and Technology (State University) Scientific Research Institute for System Analysis of Russian Academy of Sciences, Moscow Moscow Aviation Institute (National Research University) National Research Nuclear University (MEPhI), Moscow
Advisory Board Alexander N. Gorban (Tentative Chair of the International Advisory Board) Nicola Kasabov
Jun Wang
Department of Mathematics, University of Leicester, UK Professor of Computer Science and Director KEDRI, Auckland University of Technology, New Zealand PhD, FIEEE FIAPR Chair Professor of Computational Intelligence Department of Computer Science City University of Hong Kong, Kowloon Tong, Kowloon, Hong Kong
Program Committee of the XXIV International Conference “Neuroinformatics-2022” General Chair Kryzhanovskiy Boris Gorban Alexander Nikolaevich (Co-chair)
Scientific Research Institute for System Analysis, Moscow University of Leicester, Great Britain
viii
Organization
Dunin-Barkowski Witali (Co-chair) Tiumentsev Yury (Co-chair)
The Moscow Institute of Physics and Technology (State University) Moscow Aviation Institute (National Research University)
Program Committee Ajith Abraham
Baidyk Tatiana Balaban Pavel Borisyuk Roman Burtsev Mikhail Cangelosi Angelo Chizhov Anton Dolenko Sergey Dosovitskiy Alexey Dudkin Alexander Ezhov Alexander
Golovko Vladimir Hayashi Yoichi Husek Dusan Izhikevich Eugene Jankowski Stanislaw Kaganov Yuri Kazanovich Yakov Kecman Vojislav Kernbach Serge
Klimov Valentin
Machine Intelligence Research Labs (MIR Labs), Scientific Network for Innovation and Research Excellence, Washington, USA The National Autonomous University of Mexico, Mexico Institute of Higher Nervous Activity and Neurophysiology of RAS, Moscow Plymouth University, UK The Moscow Institute of Physics and Technology (State University) Plymouth University, UK Ioffe Physical Technical Institute, Russian Academy of Sciences, St Petersburg Skobeltsyn Institute of Nuclear Physics Lomonosov Moscow State University Albert-Ludwigs-Universität, Freiburg, Germany United Institute of Informatics Problems, Minsk, Belarus State Research Center of Russian Federation “Troitsk Institute for Innovation & Fusion Research”, Moscow Brest State Technical University, Belarus Meiji University, Kawasaki, Japan Institute of Computer Science, Czech Republic Braincorporation, San Diego, USA Warsaw University of Technology, Poland Bauman Moscow State Technical University Institute of Mathematical Problems of Biology of RAS, Pushchino, Moscow Region Virginia Commonwealth University, USA Cybertronica Research, Research Center of Advanced Robotics and Environmental Science, Stuttgart, Germany National Research Nuclear University (MEPhI), Moscow
Organization
Koprinkova-Hristova Petia Litinsky Leonid Makarenko Nikolay
Mishulina Olga Narynov Sergazy Oseledets Ivan Panov Aleksandr Pareja-Flores Cristobal Vladimir Red’ko Samsonovich Alexei Sandamirskaya Yulia Shaposhnikov Dmitry Shepelev Igor Shumskiy Sergey Terekhov Serge Tiumentsev Yury Trofimov Alexander Tsodyks Misha Tsoy Yury Ushakov Vadim Vvedensky Viktor Wunsch Donald Yakhno Vladimir Yudin Dmitry
ix
Institute of Information and Communication Technologies, Bulgaria Scientific Research Institute for System Analysis, Moscow The Central Astronomical Observatory of the Russian Academy of Sciences at Pulkovo, Saint-Petersburg National Research Nuclear University (MEPhI), Moscow Alem Research, Almaty, Kazakhstan Skolkovo Institute of Science and Technology (Skoltech) Federal Research Center “Informatics and Control” RAS, Moscow Complutense University of Madrid, Spain Scientific Research Institute for System Analysis of Russian Academy of Sciences, Moscow George Mason University, USA Institute of Neuroinformatics, UZH/ETHZ, Switzerland Scientific Research Center of Neurotechnologies, Southern Federal University, Rostov-on-Don Scientific Research Center of Neurotechnologies, Southern Federal University, Rostov-on-Don P.N. Lebedev Physical Institute of the Russian Academy of Sciences, Moscow JSC “Svyaznoy Logistics”, Moscow Moscow Aviation Institute (National Research University) National Research Nuclear University (MEPhI), Moscow Weizmann Institute of Science, Rehovot, Israel Institut Pasteur Korea, Republic of Korea National Research Centre “Kurchatov Institute”, Moscow National Research Centre “Kurchatov Institute”, Moscow Missouri University of Science and Technology The Institute of Applied Physics of the Russian Academy of Sciences, Nizhny Novgorod The Moscow Institute of Physics and Technology (State University)
Contents
Neuroinformatics and Artificial Intelligence Evolution of Efficient Symbolic Communication Codes . . . . . . . . . . . . . . . . . . . . . Anton Kolonin Solving the Problem of Diagnosing a Disease by ECG on the PTB-XL Dataset Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vyacheslav Razin, Alexander Krasnov, Denis Karchkov, Viktor Moskalenko, Denis Rodionov, Nikolai Zolotykh, Lev Smirnov, and Grigory Osipov
3
13
Zero-Shot NER via Extractive Question Answering . . . . . . . . . . . . . . . . . . . . . . . . Danil Tirskikh and Vasily Konovalov
22
TreeCurveNet - An improved CurveNet for Tree Species Classification . . . . . . . . Che Zhang, Yaowen Huang, Elizaveta K. Sakharova, Anton I. Kanev, and Valery I. Terekhov
32
Dialogue Graphs: Enhancing Response Selection Through Target Node Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Grigory Minakov, Mumtozbek Akhmadjonov, and Denis Kuznetsov
39
Research Methods for Fake News Detection in Bangla Text . . . . . . . . . . . . . . . . . . A. S. M. Humaun Kabir, Alexander Alexandrovich Kharlamov, and Ilia Mikhailovich Voronkov
54
On the Question of the Dynamic Theory of Intelligence . . . . . . . . . . . . . . . . . . . . . Yuriy T. Kaganov
61
Offline Deep Reinforcement Learning for Robotic Arm Control in the ManiSkill Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huzhenyu Zhang and Dmitry Yudin
72
Neuromorphic Computing and Deep Learning Spiking Neural Network with Tetrapartite Synapse . . . . . . . . . . . . . . . . . . . . . . . . . Sergey V. Stasenko and Victor B. Kazantsev
83
xii
Contents
SAMDIT: Systematic Study of Adding Memory to Divided Input in the Transformer to Process Long Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arij Al Adel
93
Towards Solving Classification Tasks Using Spiking Neurons with Fixed Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Alexander G. Sboev, Alexey V. Serenko, Dmitry E. Kunitsyn, Roman B. Rybka, and Vadim V. Putrolaynen A Spiking Neuron Synaptic Plasticity Model Optimized for Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Mikhail Kiselev, Alexander Ivanitsky, Dmitry Ivanov, and Denis Larionov Centre-Lateral Threshold Filtering as a Method for Neuromorphic Data Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Viacheslav E. Antsiperov and Elena R. Pavlyukova Neural Networks and Cognitive Sciences Permanent Sharp Switches in Brain Waves During Spoken Word Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Victor Vvedensky, Vitaly Verkhlyutov, and Konstantin Gurtovoy Cognitive Neuro-Fuzzy Control Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Lev A. Stankevich Neurocognitive Processing of Attitude-Consistent and Attitude-Inconsistent Deepfakes: N400 Study . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Eliana Monahhova, Alexandra N. Morozova, Dmitry A. Khoroshilov, Dmitry O. Bredikhin, Anna N. Shestakova, Victoria V. Moiseeva, and Vasily A. Klucharev Real-Time Movement-Related EEG Phenomena Detection for Portable BCI Devices. Neural Network Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 A. Kh. Ekizyan, P. D. Shaposhnikov, D. V. Kostulin, D. G. Shaposhnikov, and V. N. Kiroy Recognition of Spoken Words from MEG Data Using Covariance Patterns . . . . . 165 Vitaly Verkhlyutov, Evgenii Burlakov, Victor Vvedensky, Konstantin Gurtovoy, and Vadim Ushakov Non-visual Eye-Movements Model During Performing Cognitive Tasks in Short-Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Polina A. Lekhnitskaya
Contents
xiii
Mechanisms for Contribution of Modifiable Inhibition to Increasing Signal-to-Noise Ratio and Contrasted Representations of Sensory Stimuli in the Neocortex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Isabella G. Silkis A Photostimuli Presenting Device for Customized SSVEP-based Brain-Computer Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Alexey V. Kozin, Anton K. Gerasimov, Alexander V. Pavlov, and Maxim A. Bakaev Graph Neural Networks for Analysis of rs-fMRI Differences in Open vs Closed Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Tatiana Medvedeva, Irina Knyazeva, Ruslan Masharipov, Maxim Kireev, and Alexander Korotkov Does a Recurrent Neural Network Form Recognizable Representations of a Fixed Event Series? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Galiya M. Markova and Sergey I. Bartsev Language Models Explain Recommendations Based on Meta-Information . . . . . 214 Olga Sofronova and Dilyara Zharikova Analysis of Text Data Reliability Based on the Audience Reactions to the Message Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 Igor M. Artamonov and Yana N. Artamonova Adaptive Behavior and Evolutionary Simulation Analysing Family of Pareto Front-Based Evolutionary Algorithms for PINNs: A Case Study of Solving the Laplace Equation with Discontinuous Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Tatiana Lazovskaya, Dmitriy Tarkhov, Maria Chistyakova, Egor Razumov, Anna Sergeeva, and Veronika Palamarchuk Unawareness as a Cause of Determinism Violation. A Metaphoric Model . . . . . . 247 Vladimir B. Kotov and Zarema B. Sokhova The Variable Resistor Under a High-Frequency Signal . . . . . . . . . . . . . . . . . . . . . . 257 Galina A. Beskhlebnova and Vladimir B. Kotov Modeling of Natural Needs of Autonomous Agents . . . . . . . . . . . . . . . . . . . . . . . . 267 Zarema B. Sokhova and Vladimir G. Red’ko Study of Modifications of Gender Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . 279 Gavriil Kupriyanov, Igor Isaev, and Sergey Dolenko
xiv
Contents
Modern Methods and Technologies in Neurobiology Mean-Field Model of Brain Rhythms Controlled by Glial Cells . . . . . . . . . . . . . . 293 Sergey V. Stasenko and Tatiana A. Levanova Modeling Neuron-Like Agents with a Network Internal Structure . . . . . . . . . . . . 300 Liudmila Zhilyakova Cognitive Functions of Cerebellum and Educational Neuroscience . . . . . . . . . . . . 308 Vladislav Dorofeev The Role of Pulvinar Nucleus as a Sinchronizer of Cortical Activity for Visual Target Detection is Caused by Its Functioning as Superior Colliculus–Cortex Intermediary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 I. A. Smirnitskaya The Influence of Anxiety and Exploratory Activity on Learning in Rats: Mismatch-Induced c-Fos Expression in Deep and Superficial Cortical Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 Alexandra I. Bulava, Zhanna A. Osipova, Vasiliy V. Arapov, Alexander G. Gorkin, Igor O. Alexandrov, Tatiana N. Grechenko, and Yuri I. Alexandrov Applications of Neural Networks Image Processing with Reservoir Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 337 Mikhail S. Tarkov and Victoria V. Ivanova Investigation of a Spike Segment Neuron in the Offline Multi-Object Tracking Task with Embeddings Constructed by a Convolutional Network . . . . . 346 Ivan Fomin, Anton Korsakov, Viktoria Ivanova, and Aleksandr Bakhshiev Realization of Super-Resolution Using Bicubic Interpolation and an Efficient Subpixel Model for Preprocessing Low Spatial Resolution Microscopic Images of Sputum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 I. G. Shelomentseva An Intelligent Day Ahead Solar Plant’s Power Forecasting System . . . . . . . . . . . 362 Ekaterina A. Engel and Nikita E. Engel Determining the Significance of Input Features in Predicting Magnetic Storms Using Machine Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 Roman Vladimirov, Vladimir Shirokiy, Oleg Barinov, and Irina Myagkova
Contents
xv
Forest Damage Segmentation Using Machine Learning Methods on Satellite Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380 Natalya S. Podoprigorova, Grigory A. Savchenko, Ksenia R. Rabcevich, Anton I. Kanev, Andrey V. Tarasov, and Andrey N. Shikohov Binding Affinity Prediction in Protein-Protein Complexes Using Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 Elizaveta A. Bogdanova, Valery N. Novoseletsky, and Konstantin V. Shaitan Domain Adaptation of Spacecraft Data in Neural Network Prediction of Geomagnetic Dst Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 Elvir Z. Karimov, Vladimir R. Shirokiy, Oleg G. Barinov, and Irina N. Myagkova LQR Approach to Aircraft Control Based on the Adaptive Critic Design . . . . . . . 406 Maxim I. Chulin, Yury V. Tiumentsev, and Ruslan A. Zarubin SNAC Approach to Aircraft Motion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420 Yury V. Tiumentsev and Roman A. Tshay Generating Generalized Abstracts Using a Hybrid Intelligent Information System for Analysis of Judicial Practice of Arbitration Courts . . . . . . . . . . . . . . . 435 Maria O. Taran, Georgiy I. Revunkov, and Yuriy E. Gapanyuk Integration of Data from Various Physical Methods in Solving Inverse Problems of Spectroscopy by Machine Learning Methods . . . . . . . . . . . . . . . . . . . 445 Artem Guskov, Igor Isaev, Sergey Burikov, Tatiana Dolenko, Kirill Laptinskiy, and Sergey Dolenko The Use of a priori Information in the Neural Network Solution of the Inverse Problem of Exploration Geophysics . . . . . . . . . . . . . . . . . . . . . . . . . . 455 Igor Isaev, Ivan Obornev, Eugeny Obornev, Eugeny Rodionov, Mikhail Shimelevich, and Sergey Dolenko Neural Network Theory, Concepts and Architectures Study of Rescaling Mechanism Utilization in Binary Neural Networks . . . . . . . . 467 Ilia Zharikov and Kirill Ovcharenko Estimating the Transfer Learning Ability of a Deep Neural Networks by Means of Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479 German I. Magai and Artem A. Soroka
xvi
Contents
Russian Language Speech Generation from Facial Video Recordings Using Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489 Miron M. Leonov, Artem A. Soroka, and Alexander G. Trofimov Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
Neuroinformatics and Artificial Intelligence
Evolution of Efficient Symbolic Communication Codes Anton Kolonin1,2(B) 1 Novosibirsk State University, Pirogova 1, Novosibirsk 630090, Russia
[email protected] 2 Aigents, Pravdy 6-12, Novosibirsk 630090, Russia
Abstract. The paper explores how the human natural language structure can be seen as a product of evolution of inter-personal communication code, targeting maximization of such culture-agnostic and cross-lingual metrics such as antientropy, compression factor and cross-split F1 score. The exploration is done as part of a larger unsupervised language learning effort, the attempt is made to perform meta-learning in a space of hyper-parameters maximizing F1 score based on the “ground truth” language structure, by means of maximizing the metrics mentioned above. The paper presents preliminary results of cross-lingual wordlevel segmentation tokenization study for Russian, Chinese and English as well as subword segmentation or morpho-parsing study for English. It is found that language structure form the word-level segmentation or tokenization can be found as driven by all of these metrics, anti-entropy being more relevant to English and Russian while compression factor more specific for Chinese. The study for subword segmentation or morpho-parsing on English lexicon has revealed straight connection between the compression been found to be associated with compression factor, while, surprising, the same connection with anti-entropy has turned to be the inverse. Keywords: Communication Code · Compression · Cross-lingual · Entropy · Unsupervised Language Learning · Natural Language · Meta-learning · Subword Segmentation · Tokenization
1 Introduction The latest advances in natural language processing demonstrated by so-called large language models (LLM) [1] are typically based on tokenization relying on hardcoded punctuation rules and subword segmentation based on so called byte-pair encoding (BPE) [2] and different variations of it such as BPE with dropout [3] and dynamic programming encoding (DPE) [4]. Relying on a hardcoded punctuation may be thought as a not fair approach to learn the language bottom-up from the ground. In turn, the known subword segmentation techniques appear not quite conforming to true language morphology. Moreover, the LLMs are based on non-interpretable distributed representations where the interiors of the model can not be explicitly validated to conform with true knowledge are grammatical rules even though the latest versions of can approximate human © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 3–12, 2023. https://doi.org/10.1007/978-3-031-44865-2_1
4
A. Kolonin
language decently. The computational resources required for that, given the number of parameters in the latest models is enormous, which may be compensating inaccuracy of the subword tokenization schemes. The alternative language learning approach based on interpretable formal language models such as Link Grammar [5] has beeb explored in earlier works with some promising results obtained for English language [6], however this work has been based on hardcoded tokenization rules as well. The attempts on learn tokenization models unsupervisedly have been made in other prior work such as [7] and then [8], where the latest work has been involving multiple languages. Unfortunately, the levels of accuracy presented the latter works have been found way below the ones obtained with hardcoded tokenization schemes. The latest work [9] improves the accuracy of cross-lingual tokenization reaching F1 score 0.99 for English, 1.0 for Russian, and 0.71 for Chinese. Such level of accuracy has been achieved using so-called “transition freedom” (“freedom of transition”) metric, apparently relying on the fundamental ground of so-called “free energy principle” suggested by earlier fundamental work [10] where the minimization of uncertainty is posed as a key principle of brain function and hence may be applied to nature of the human language structure. However, the latest results have been achieved with manual search for optimal hyper-parameters. The very latest work [11], applied for the same three languages, has attempted to find a way for meta-learning for optimal hyper-parameters finding connection between target F1 score of tokenization itself and introduced culture-agnostic metrics called normalized anti-entropy (1), compression factor and cross split F1-score described in the paper. ∼
S = 1 − H /(log2(L))
(1)
The normalized anti-entropy” ˜S defined above based on H is a Shannon entropy of entire training set tokenized with given tokenization model, where L is size of the lexicon underlying the tokenization model. The compression factor C% is asserted as the ratio between numerator as “compressed” size of training set given current tokenization model and denominator as uncompressed size of training set. The compressed size is evaluated as length of sequence of token indexes in tokenized text corpus entirely plus size of the “dictionary” - sum of lengths of all tokens in it. The uncompressed size of training set is evaluated just as count of symbols in it. The cross-split F1 score called CSF1 in [11] is defined as follows. First, we split the training set corpus in two pieces of the same size, call them set A and set B. Next, we create the graph traversal models across N-grams according to the previously cited work [9] for each of the corpus, call them M(A) and M(B). Then, we tokenize the test set with both models, so that T(M(A)) and T(M(B)) are obtained. Finally, evaluate the cross-split F1 score of tokenization as CSF1 for T(M(A)) against T(M(A)) having one as a “ground truth” for another. Unfortunately, the corpora in the latter work were not well aligned so the conclusions drawn from that work might be not seen as quite reliable, so in this work we have tried to reproduce the claimed results with different aligned corpora for the same English, Russian and Chinese languages. Moreover, we have tried to expand the scope of the study to address the subword segmentation problem.
Evolution of Efficient Symbolic Communication Codes
5
2 Approach 2.1 Tokenization or Word-Level Text Segmentation The interpretable language learning developed through earlier works [5, 6, 9, 11] is based on assumption of possibility of learning graph-based models through graph analysis of hierarchy of linguistic entries. The basic parts of the learning process are a) segmentation of stream of smaller linguistic entities into larger groups and b) clustering of these entities on different levels into categories. That is, the former enables learning word-level and punctuation-level tokenization, morphological parsing and then phrase and sentence boundary detection. In turn, the latter makes it possible to learn categories of letters and punctuation such as vowels, consonants, digits, delimiters and quoting symbols or categories of words such as determiners, nouns and verbs. According to studies in [7, 9, 11], initial raw representation of the corpus can be built a weighted graph of transitions through the training corpus based on N-grams of different arity. This makes it possible to build further probabilistic models of transitions from any path on the graph to the next segment of the path, having the probabilities and transition freedom computed on each possible transitions. On the level of characterbased graph (N = 1 for N-gram to N-gram transitions), clustering characters in space of the forward and backward transitions on the graph, only this raw model makes it possible to learn impressive clusters for parts of speech and punctuations as shown on Fig. 1, based on RusAge corpus (https://www.kaggle.com/datasets/oldaandozerskaya/ fiction-corpus-for-agebased-text-classification).
Fig. 1. Clustering characters ins space of forward and backward unigram-to-unigram transitions in based on test subset of RusAge corpora. Clusters that may be identified left-to-right: English vowels, English consonants, digits, Russian vowels, punctuation symbols, Russian consonants.
The tokenization technique based on such representations, according to [7, 9, 11] can be implemented based on detecting the peaks of the transition freedom profiles built on forward and backward traversal along the stream of N-grams. The hyper parameters
6
A. Kolonin
of such process, according to [9, 11] are N (oder of N-gram), threshold used to detect the peaks on transition freedom profiles, and threshold to prune low-frequency transitions in the raw model before the tokenization process. The latest work [11] claims correlation between tokenization F1 score obtained with different combinations of the hyper-parameters and suggested culture-agnostic metrics as it has been explored my means of grid search such as shown on Fig. 2, applied to English Brown corpus as a train (http://www.sls.hawaii.edu/bley-vroman/brown_nol ines.txt) and English MagicData corpus as a test (https://magichub.com/datasets/chineseenglish-parallel-corpus-finance).
Fig. 2. Grid search for the maximized tokenization F1 score (top) along with the best compression factor (middle) and normalized anti-entropy (bottom), correlation is seen across all three.
In this work we attempt to validate the findings mentioned above using technique described in [9, 11] but applying to better aligned, richer and diverse train and test corpora. 2.2 Morphological Parsing or Subword-Level Text Segmentation Moreover, we explore the possibility of subword segmentation or morphological parsing (“morpho-parsing”) conducted following the same approach applied to individual presegmented tokens or words in attempt to detect word-pieces complying to morphology known to a human language. The primary goal was to see if the described approach can be used to achieve subword segmentation emitting word pieces closer to known morphological units, compared to known approaches [2–4]. The secondary goal was to explore connection between the accuracy of such process compared to the cultureagnostic metrics used in the former study.
Evolution of Efficient Symbolic Communication Codes
7
For initial experiments we have used Aigents English lexicon downloaded from https://raw.githubusercontent.com/aigents/aigents-java/master/lexicon_english.txt as a training corpus, where the N-gram-to-character transition graph model has been built for N in range from 1 to 7 inclusively, with account to word frequency count, so the transition count per graph edge was taking the word frequency in account. For reference “ground truth” morphological parsing to compute F1 score against we have used “greedy parser” based on morphological units, such as prefixes and suffixes, downloaded from https://github.com/aigents/pygents/tree/main/data/corpora/Eng lish/morphology. To control the quality of such reference we have used the control set of words used for illustration in [4] and have got the result of its reference tokenization reviewed by native English speaker with 100% acceptance.
3 Results 3.1 Tokenization or Word-Level Text Segmentation The first round of unsupervised cross-language tokenization experiments was performed, against two completely different lines of data sets.
Fig. 3. Scatter plots indicating connections of hyper-parameters of unsupervised tokenization with F1 score (vertical axis) and culture-agnostic metrics such as anti-entropy, compression factor, cross-split F1 and average of all three (four plots left to right on each halves of the figure) – for two lines of data sets on the left (non-aligned) and on the right (completely aligned). Results are presented for different languages: English (top), Russian (middle), and Chinese (bottom).
The first line of experiments involved the same corpora as in [9, 11] – English Brown, Russian RusAge and Chinese CLUE Benchmark News 2016 dataset (https://github. com/brightmart/nlp_chinese_corpus) with 1000 text rows from the every set randomly selected for test, as presented on the left side of Fig. 3. Notably, all there corpora were quite different and not aligned in any way, so the obtained results might seem not
8
A. Kolonin
comparable across the languages. The second line of experiments, presented on the right of Fig. 3, involved the same English Brown corpus, machine-translated by Google into Chinese and Russian for training and independent parallel Magic Data corpus for test, so both train and test sets were 100% aligned. The figure shows linear dependency between F1 score and all four culture-agnostic metrics for Russian and English regardless data set however for Chinese this correlation appears less consistent and obvious.
Fig. 4. Scatter plots indicating connections of hyper-parameters of unsupervised tokenization with F1 score (vertical axis) and culture-agnostic metrics such as anti-entropy, compression factor, cross-split F1 and average of all three (four plots left to right on each halves of the figure) – for the same Brown data set used for training, but different test sets and testing methodologies and test set sizes. Different test sets: MagicData and CLUE News (left half) and sighan2005/as_test_gold (right half). In the first case (left) reference Jieba tokenizer was used and in the second case (right) manual tokenization markup from the test corpus was used. Different numbers of text lines were used: 100 lines (top), 1000 lines (middle), and 10000 lines (bottom).
Another experiment presented on Fig. 4 has been run on the graph model learned from the same Chinese version of Brown corpus, with different numbers of lines in test sets – 100, 1000, and 1000, sparsely selected from the original test sets. Also, the two different test sets and testing methodologies were used. First, we have used Jieba tokenizer applied to MagicData and CLUE News test sets as in [9, 11] and next we have used reference manual tokenization coming as part of the sighan2005/as_test_gold downloaded from https://github.com/hankcs/multi-criteria-cws/tree/master/data/sighan 2005. The findings of the latter experiment is that the correlation between F1 score of tokenization and culture-agnostic metrics for Chines becomes obvious and larger volume of testing data, at 10000 lines as shown on Fig. 4. Specifically, anti-entropy and compression factor appear the most clearly correlated with the F1 score. In turn, the cross-split F1 score appear the least clearly correlated being obscured by multiple dots in the right side of each plot, likely corresponding to local extremums in the space of hyper-parameters.
Evolution of Efficient Symbolic Communication Codes
9
Summarizing the experiments above extending the earlier works [9, 11] along with presented on Figs. 2, 3 and 4, we can conclude that the structure of the language from tokenization perspective, regardless of specific linguistic culture, at least in context of written English, Russian and Chinese, is optimized to maximize both anti-entropy and compression factor for overall volume of communication. The above can be explained as follows. For any language, we have the same large corpus of texts, which can be tokenized in different ways, relying on N-gram-to-character transition models built from the corpus, given the hyper-parameters discussed in [9, 11]. The set of the hyper-parameters, making it possible to get the most accurate tokenization with superior F1 score is assumed to correspond to the some cognitive settings in human brain making it possible for humans to comprehend the languages in the way we all do. We can find these hyper-parameters trying tokenizations with different combinations of them in the space of hyper-parameters, referring to test sets with known tokenizations as a “ground truth”. But let us explore if the same hyper-parameters can be found without of knowing the “ground truth” as a reference. Can we just pretend the speech and text segmentations are made not just to make the discovered tokens identical to known words given as a reference, but to make the compression of the information stored in the texts efficient and have its entropy minimized. To check this, we can do the tokenization of the same text test sets in the same space of hyper-parameters, computing the compression factor C% and anti-entropy ˜S on the tokenized test set [11]. So we try it and we do find that the same hyper-parameters that correspond to the highest compression factor and anti-entropy are also corresponding to the highest F1 score of tokenization. Moreover, we find the same connection for English, Russian and Chinese. The other culture-agnostic metric, after compression factor and anti-entropy is crosssplit F1 score described above, referring to [11]. The nature of this metric is the following. Let say we split the same language corpus in two pieces. We can pretend the two pieces are corresponding to different groups of people using the same language to communicate in different patterns and on different aspects, so think of them as a two sub-cultures of the same linguistic culture. Then let us build tokenization model from the first corpus and use the model to tokenize test set from the second corpus. After that, do the opposite – build tokenization model from the second corpus and use it to tokenize test set from the first corpus. For each tokenization of the test sets we compute F1 score and get average of the two – that is what we call cross-split F1 score or CSF1. We assume that the ability to get higher CSF1 scores correspond to higher ability of a linguistic sub-culture learned on one subset of language understand texts from another subset of language attributed to another linguistic sub-culture. We can do that with different hyper-parameters as we did it for C% and ˜S against F1 score. On the path of this exploration, we have found that the cross-split F1 score CSF1 has connection to tokenization F1 similar to its connection with compression-factor and anti-entropy. So we conclude that the former cross-lingual metric, representing ability of members of different linguistic sub-cultures to understand each other can be considered as another objective for unsupervised identification of the optimal tokenization hyperparameters for any language. Should be noted that for Chinese this connection appears not as reliable as it is for English and Russian.
10
A. Kolonin
The difference between the three languages is that the expression of connection between the metrics apparently depends on the size of alphabet. English with 26 letters has the most expressive connection, see top row of Fig. 3. The expression is more fuzzy for Russian with 33 letters in alphabet, see middle row of Fig. 3. The most blurred connection for Chinese, is seen on bottom row of Fig. 3. We can attribute it to huge size of Chinese alphabet. Still, the connection between the metrics gets clearer with increase of size of Chinese test set, as seen on Fig. 4, where linear correlation between C%, ˜S and F1 score across different sets appears with test set size of 10000 lines, see bottom row of Fig. 4.
Fig. 5. Scatter plots representing distribution of F1 scores of sub-word segmentation compared to greedy morphological parsing relying on dictionary of English suffixes and prefixes against culture-agnostic metrics such as (left to right) anti-entropy, compression factor, average of the two, and production of the two. Top row corresponds to experiment based on English lexicon words longer than 10 characters, bottom row corresponds to use of all words in the lexicon.
3.2 Morphological Parsing or Subword-Level Text Segmentation Preliminary cursory study for morphological parsing or subword segmentation has been performed for English lexicon. The goal of the study was to see if accuracy of such segmentation validated on real English morphology can be associated with the cultureagnostic metric described earlier. We have have trained the graph model of N-gram-tocharacter transitions with N in range from 1 to 10 based on English lexicon mentioned in Sect. 2.2 with account to relative word frequency known from the lexicon data. Then we have performed subword segmentation with different hyper-parameters such as N and threshold on transition freedom peak value, used to detect the text segment boundary, according to [9, 11], using the same lexicon as a test set. The F1 score of segmentation
Evolution of Efficient Symbolic Communication Codes
11
was computed on word-by-word basis referring to “greedy” morphological parser relying on reference dictionaries of English prefixes and suffixes. The overall F1 score across words was computed as weighted average with account to word frequency in lexicon. Similarly, the anti-entropy ˜S and compression factor C% metrics were computed with account to the word frequency. In addition to the basic metrics we also used average ˜S + C% and production ˜S*C% derivatives. Results of the experiment in a space of hyper-parameters are presented on Fig. 5 are showing strong positive correlation between morpho-parsing F1 score and compression factor, as it was found for the case of tokenization earlier. However, surprisingly, the anti-entropy has been found to render rather strongly negative correlation with the F1 score, quite opposite to what we have learned in case of tokenization, as it is shown on Fig. 6.
Fig. 6. Illustration of similarly positive connection between text segmentation F1 score and compression factor C% with opposite nature of connection of F1 score and anti-entropy ˜S in case of tokenization (left two scatter plots) and morpho-parsing (right two scatter plots).
4 Conclusions We have confirmed strongly positive connection between the culture-agnostic information metrics such as anti-entropy, compression factor and cross-cultural consistency (cross-split F1 score) as guiding the language model structure at the level of wordlevel segmentation or tokenization across Russian, English and Chines languages. That suggests the nature of the language evolution as a generic process of development of symbolic communication codes efficient from multiple perspectives and respective measures. We have also found similar connection in respect to sub-word level segmentation or morphological parsing in case of English, but that was limited to compression factor measure, while the association of the language structure at this level with anti-entropy has been found opposite, so the most accurate morphological segmentation of a text corresponds to maximized compression factor and maximized entropy (minimized antientropy) at the same time which appears surprising, needs further study.
12
A. Kolonin
Acknowledgments. We are grateful to Sergey Terekhov and Nikolay Mikhaylovskiy for valuable questions, critique, recommendations and suggestions during the course of work.
References 1. Zhao, W., et al.: A Survey of Large Language Models. arXiv abs:2303.18223 [cs.CL] (2023) 2. Gage, P.: A new algorithm for data compression. In: The C Users Journal, vol. 12, issue 201, pp. 23–38 (1994) 3. Provilkov, I., Emelianenko, D., Voita, E.: BPE-dropout: simple and effective subword regularization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1882–1892, Online. Association for Computational Linguistics (2020) 4. He, X., Haffari, C., Norouzi, M.: Dynamic programming encoding for subword segmentation in neural machine translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3042–3051, Online. Association for Computational Linguistics (2020) 5. Vepstas, V., Goertzel, B.: Learning language from a large (unannotated) corpus. In: Computing Research Repository, arXiv:1401.3372 [cs.CL] (2014) 6. Glushchenko, A., Suarez, A., Kolonin, A., Goertzel, B., and Baskov, O.: Programmatic link grammar induction for unsupervised language learning. In: Artificial General Intelligence, pp. 111–120. Springer International Publishing, Cham (2019) 7. Wrenn, J., Stetson, P., Johnson, S.: An unsupervised machine learning approach to segmentation of clinician-entered free text. In: Proceedings of the AMIA Annual Symposium, pp. 811–5 (2007) 8. Kearsley, L.: A hybrid approach to cross-linguistic tokenization: morphology with statistics. In: Brigham Young University, Theses and Dissertations, pp. 5984 (2016) 9. Kolonin, A., Ramesh, V.: Unsupervised tokenization learning. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3649–3664 (2022) 10. Friston, K.: The free-energy principle: a unified brain theory? Nature Reviews Neuroscience 11(2), 127–138 (2010) 11. Kolonin, A.: Self-tuning hyper-parameters for unsupervised cross-lingual tokenization. arXiv: 2303.02427 [cs.CL] (2023)
Solving the Problem of Diagnosing a Disease by ECG on the PTB-XL Dataset Using Deep Learning Vyacheslav Razin(B) , Alexander Krasnov, Denis Karchkov, Viktor Moskalenko, Denis Rodionov, Nikolai Zolotykh, Lev Smirnov, and Grigory Osipov Lobachevsky State University of Nizhny Novgorod, 603022 Nizhny Novgorod, Russia [email protected] http://www.unn.ru/
Abstract. Diagnosis by electrocardiogram (ECG) is an extremely urgent and important task, the quality, timeliness and speed of which people’s lives and health depend on. To date, a large number of researchers treat neural networks as a panacea, hoping that any task can be solved quickly and without problems. Often, this approach does not lead to the best results. The article explores the use of deep learning as a universal tool for solving the problem in determining pathological ECG signals with markers of myocardial infarction, hypertrophy, conduction disturbances, and changes in ST segment morphology. During the experiments, the positive impact of using thresholding and replacements to increase the predictive ability of the network, the use of various ensembles on trained deep learning models was established. The addition of artificial models also improves the classifying ability of ensembles. Returning a random number in the absence of a single mode also makes it possible to increase the accuracy of the ensemble. Keywords: deep learning · ECG diagnostics · recurrent neural network · convolutional neural network · ensemble · artificial intelligence · multilabel classification
1
Introduction
The wide spread worldwide of cardiovascular diseases (CVD) in the second half of the 20th century served as a reason to consider them as an “epidemic of CVD”. CVD will remain the most pressing health problem in most countries of the world in the 21st century. Mortality from cardiovascular diseases occupies one of the leading places in the overall structure of mortality. Diseases of the cardiovascular system can truly be considered the main ones in shaping the health of the nation. According to statistics compiled by the World Health Organization, the percentage of deaths from diseases of the heart and blood vessels, although steadily decreasing, still remains prohibitively high: for 2020, the percentage of people with various pathological conditions of the heart muscle is 31%. Most c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 13–21, 2023. https://doi.org/10.1007/978-3-031-44865-2_2
14
V. Razin et al.
cardiologists agree that this type of disease is easier to stop in the early stages of development. Among the pathologies that cause increased attention among cardiologists are myocardial infarction, impaired conduction of the heart tissue, hypertrophy and changes in the ST segment of the ECG signal. Each of the presented diseases is critically dangerous for humans and may indicate a malnutrition of the heart muscle, the occurrence of necrosis, the appearance of heart tissues with impaired conduction of nerve impulses, an increase in workload, and so on. In recent years, there has been a relative decrease in mortality rates from cardiovascular diseases, but Russia, according to this criterion, is in first place in comparison with the countries of the European Union. It should be noted that from 2008 to 2018, CVD mortality in Russia decreased by 32%. Improving the quality of medical care makes it possible to influence the dynamics of mortality from cardiovascular diseases. To date, the mechanism of universal, comprehensive screening of the population is being actively integrated in Russian healthcare. This approach will identify predictors of early degradation of the cardiovascular system in a patient. However, due to the critically large imbalance in the number of doctors and patients, it is impossible to implement full control of the condition of each patient. One of the solutions to this problem is the use of decision support systems based on a combination of the diagnostic capabilities of mathematical algorithms and machine learning methods. Also, over the past few years, significant progress has been made in the field of automating ECG diagnostics, on a number of subtasks the computer has already surpassed cardiologists in solving them, and a large number of studies are being carried out in this area. Thanks to the accumulated over the years of cardiology data, today researchers can use machine learning algorithms to their full potential, analyzing both the quality of the original data and the capabilities of the applied methods. It is also important that the data are freely distributed and anonymized, which allows you to explore the original signal without restrictions. The most popular for signal analysis are various neural networks that have shown excellent results in the processing of two-dimensional signals, in particular images, and are suboptimal, but almost always working solutions, including in the analysis of biologically active signals. The article discusses the analysis of twelve-channel ECG obtained from ten leads of the cardiograph in order to determine normal cardiograms (without obvious deviations from the norm), as well as signals with markers of conduction disturbances, hypertrophy, infarction and ST segment elevation. Within the framework of this work, the procedure for diagnosing an electrocardiogram by software is considered. Typically, ECG datasets are very small, but not in the case of the PTB-XL dataset [1] hosted by PhysioNet [2]. On this data set, the problem of diagnosing an electrocardiogram of classifying the main subclasses of diagnoses is solved. Deep learning is used as an ECG analysis tool, which demonstrates good performance in solving classification problems in their various varieties, as well as in ECG diagnostic problems.
ECG Analysis by Deep Learning
15
Approaches for the classification of time series are described in many works [3,4] using various architectures of neural networks and methods [5–7].
2 2.1
Data and Methods PTB-XL Data Set
This section briefly introduces the PTB-XL [1] dataset that underlies the experiments presented below. The PTB-XL dataset contains 21,837 clinical 12-channel ECG recordings of 10 s duration at a sampling rate of 100 Hz. There are also diagnostic statements consisting of 5 subclasses (NORM: normal ECG, CD: conduction abnormality, MI: myocardial infarction, HYP: hypertrophy, and STTC: ST/T changes). Additional and all basic information about the dataset is contained in the original publication [1]. 2.2
Time Series Classification Algorithms
For a comparative analysis of different classification algorithms, this work is based on algorithms that work with raw multivariate time series data. This work uses 3 main categories of deep neural network architectures: – convolutional neural networks; – recurrent neural networks; – a combination of convolutional and recurrent networks. The models were trained on the original time series data without any additional preprocessing or using mean normalization. 2.3
Multilabel Classification Metrics
As metrics for evaluating classification models, preference was given to the ABS metric (percentage of absolutely accurately predicted diagnoses), which clearly shows the percentage of accurate predictions of a particular model. This metric shows the absolute accuracy of predictions and well distributes various machine learning methods according to their ability to predict the desired set of subclasses.
3 3.1
Experimental Results Description of the Task
The problem of multilabel classification of prediction of the presence of classes (NORM, MI, HYP, STTC, CD) is being solved based on the input 12-channel 10-s ECG recording with a sampling frequency of 100 Hz. This task is relevant, since at the moment the selected dataset is the largest open dataset in the world, which periodically receives updates. This data size makes it possible to predict
16
V. Razin et al.
the growth of predictive ability by increasing the training sample. As a result of the experiments, the top 10 architectures were taken, which demonstrated the greatest predictive ability compared to others. The suggested ABS score applies equally to each experiment. Top 10 predictive models: – – – – – – – – – –
conventional convolutional network [8] (hereinafter CNN); separable convolutional network [9] (hereinafter SCNN); AlexNet-based network [10] (hereinafter AlexNet); combination of a convolutional network with bidirectional LSTM and GRU [11] (hereinafter CBB); Inception-based network [12] (hereinafter Inception); recurrent network LSTM [13] (hereinafter LSTM); GRU recurrent network [14] (hereinafter GRU); combination of LSTM and GRU (hereinafter LG); VGG16 based network [15] (hereinafter VGG16); Xception-based network [16] (hereinafter Xception).
3.2
The Ensemble with the Best Predictive Power
As a result of the experiments, an ensemble was found consisting of multilabel models: AlexNet, CBB, LSTM, LG, CNN, multiclass models: CBB, LSTM, SCNN, LG and 4 artificial models that always return the following numbers of diagnoses sets: 10, 12, 24, 28. The ABS of this ensemble is 73.05% (rounded to the nearest hundredth). 3.3
Experimental Details
An experimental method was chosen as a research method, its essence lies in the development of new architectures of neural networks and checking the level of their classification after the learning process. Based on the results of training the selected neural networks, all possible ensembles of various types are built and the best one is identified in terms of its classifying ability. The data is split into training and test samples in such a way that there is an equal distribution of sets of diagnoses within each sample, and the data has a ratio of 9:1. A change in this ratio negatively affects the predictive ability of ensembles. The use of cross-validation also does not improve the accuracy of model predictions. Average normalization is used for the input data. The learning process involves callback functions that keep the weights of networks with the lowest loss on the test set, reduce the learning rate, or stop learning altogether if there is no decrease in the loss value on the test set for a long time. Using the KerasTuner optimizer for selecting neural networks hyperparameters by all available methods (RandomSearch, GridSearch, BayesianOptimization, Hyperband, Sklearn) also does not improve their predictive ability. The diagnoses are distributed by a vector of length 5, where 1 in position means the presence of a particular class, and 0 its absence (hereinafter referred
ECG Analysis by Deep Learning
17
to as multilabel). However, this paper additionally considers an alternative approach, where the set of diagnoses is a vector of length 32, where 1 means the presence of a set of diagnoses at once and only them, and 0 the absence of this particular set (hereinafter referred to as multiclass). The networks are trained by these two methods. ABS scores of neural networks are shown in Table 1 (highest ABS for each case in bold). Table 1. ABS for multilabel and multiclass models (as a percentage, rounded to hundredths) Model name ABS multilabel multiclass AlexNet
67.62
69.18
VGG16
67.80
69.14
Inception
66.05
68.08
CBB
68.13
70.06
LSTM
67.25
68.45
GRU
66.65
68.40
Xception
66.65
69.05
SCNN
67.16
68.72
LG
67.39
69.74
CNN
67.53
68.82
Of these, those whose ABS was greater than or equal to 68.1% (rounded to hundredths) were taken and 9 different ensembles were built on them. The mla ensemble takes as input the probabilities of the multilabel format models (a vector of length 5) and returns the arithmetic mean of the participants as an answer, after which it is mathematically rounded. The fmla ensemble takes as input the probabilities of the multilabel format models (a vector of length 5) with mathematical rounding and returns the arithmetic mean of the participants as an answer, after which it is mathematically rounded. The mca ensemble takes as input the probabilities of the multiclass format models (a vector of length 32) and returns the arithmetic mean of the participants as an answer, after which it is mathematically rounded. The fmca ensemble takes as input the number of the set of diagnoses with the highest prediction probability and returns the arithmetic mean of the participants as an answer, after which it is mathematically rounded. The mlk ensemble takes as input the probabilities of the multilabel format models and, as an answer, gives the sum of the products of the forecast by a certain coefficient, which is higher, the higher the ABS of the participant, after which it is mathematically rounded. The fmlk ensemble takes as input the probabilities of multilabel format models with mathematical rounding and as an answer gives the sum of the products of the forecast by some coefficient,
18
V. Razin et al.
which is higher, the higher the ABS of the participant, after which it is mathematically rounded. The mck ensemble takes as input the probabilities of the multiclass format models and as an answer gives the sum of the products of the forecast by some coefficient, which is higher, the higher the ABS of the participant, after which it is mathematically rounded. The fmck ensemble takes as input the number of the set of diagnoses with the highest prediction probability and as an answer gives the sum of the products of the prediction by some coefficient, which is higher, the higher the ABS of the participant, after which it is mathematically rounded. The fmcd ensemble takes as input the number of the set of diagnoses with the highest prediction probability and returns the mode among the participants as an answer, if there are several most frequent numbers, then the first of them is returned. For all ensembles, such a set of participants is searched (by exhaustive enumeration), which in total produces the largest ABS. Table 2 shows the ABS scores for each ensemble. The highest ABS score for each case is shown in bold. Table 2. ABS of ensembles from models whose ABS is greater than or equal to 68.1 (as a percentage, rounded to the nearest hundredth) Ensemble name ABS mla
70.80
mlk
70.80
fmla
70.94
fmlk
70.94
mca
71.21
mck
71.21
fmca
70.06
fmck
70.06
fmcd
71.17
It can be seen that the ensembles mca and mck show the highest ABS. After that, for each trained model, we go through its mistakes in the forecasts of the training sample, we try to find a threshold at which the model will change its forecast to the correct one. In the worst case, the predictive ability will remain the same, otherwise there will be thresholds and values at which the model will change its forecasts, thereby increasing its predictive ability. Table 3 shows the ABS performance for each network and for each approach. The highest ABS score for each case is shown in bold. You can see that absolutely all models have increased ABS after selecting replacement thresholds and the replacements themselves. It also remains unchanged that CBB gives the highest scores in both multilabel and multiclass. The VGG16, Inception and CNN models after these changes outperform
ECG Analysis by Deep Learning
19
themselves in multilabel, when before that absolutely all multiclass models outperformed themselves compared to multilabel. Table 3. ABS for multilabel and multiclass models after improvement (as a percentage, rounded to hundredths) Model name ABS multilabel multiclass AlexNet
70.11
70.20
VGG16
69.88
69.78
Inception
69.74
68.45
CBB
70.34
70.75
LSTM
69.09
69.28
GRU
68.72
69.00
Xception
68.86
69.69
SCNN
69.18
69.46
LG
68.95
70.15
CNN
70.11
69.32
Further, from all 20 models, the best fmcd ensemble is searched. Based on the results of the experiment, an ensemble consisting of such multilabel models as: AlexNet, CBB, LSTM, LG, CNN and multiclass models: CBB, LSTM, SCNN, LG is found. The ABS of this ensemble is 72.59% (rounded to the nearest hundredth). Then we introduce the fmcdr ensemble, which is a modification of fmcd. The modification is that in the case of several most frequently occurring numbers, a randomly selected one is returned. The fmcdr models are those that were members of the fmcd that produced the highest ABS. As a result of a complete enumeration of the seed of the random number generator, an ensemble is found, which already gives out 72.82% (rounded to hundredths). After that, we will try to introduce artificial models, as was done in. In this case, there will be 32, which will always return one specific diagnosis set number. To the models that showed the best result in the fmcd ensemble, we will try to add one of them and see if the fmcd ensemble index increases along with the added artificial model. During the experiments, it was found that the addition of models that return numbers of diagnoses sets: 10, 12, 24, 28, 30 has a positive effect on the ABS of the ensemble (their participation increases it). Next, we add these 5 artificial models to the models that showed the best result in the fmcd ensemble. As a result of the experiment, we get an ensemble consisting of all 9 models, the best ensemble and 4 artificial ones (which always return numbers such as: 10, 12, 24, 28), which has an ABS equal to 72.87% (rounded to hundredths). Finally, let’s find the seed that will give the highest ABS for the ensemble of fmcd models with ABS = 72.87% (rounded to hundredths) for the fmcdr
20
V. Razin et al.
ensemble. During the experiment, fmcdr was found with ABS = 73.05% (rounded to hundredths).
4
Conclusions
During the experiments, the positive impact of using thresholding and replacements to increase the predictive ability of the network, the use of various ensembles on trained deep learning models was established. The addition of artificial models also improves the classifying ability of ensembles. Returning a random number in the absence of a single mode also makes it possible to increase the accuracy of the ensemble. Trained model weights can be used to train on smaller datasets by transfer learning (as is the case with ImageNet), as well as to extract necessary features from networks for machine learning models. In the future, it is planned to consider other, more complex architectures of neural networks, the use of new unproven methods of machine learning, as well as take other types of ensembles. It is also planned to consider more complex diagnostic tasks and introduce explainable artificial intelligence to solve the tasks. Acknowledgements. Results obtained in numerical experiments with convolutional neural networks are supported by Ministry of Science and Education of Russian Federation, project 0729-2021-013. Results obtained in numerical experiments with recurrent neural networks are supported by the Federal academic leadership program Priority 2030.
References 1. Wagner, P., et al.: PTB-XL, a large publicly available electrocardiography dataset. Sci. Data 7(1), 1 (2020). https://doi.org/10.1038/s41597-020-0495-6 2. Goldberger, A.L., et al.: PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000). https://doi.org/10.1161/01.CIR.101.23.e215 3. Nikolsky, A.V., Levanov, V.M., Karchkov, D.A., Moskalenko, V.A.: Efficiency of diagnosing cardiovascular diseases in the format of a specialized service of automatic telemonitoring using the cyberheart software and hardware complex. Ural Med. J. (7), 64 (2020). https://doi.org/10.25694/URMJ.2020.07.39 4. Kalyakulina, A.I., et al.: Finding morphology points of electrocardiographic-signal waves using wavelet analysis. Rad. Quan. Electron. 61(8), 689 (2019). https://doi. org/10.1007/s11141-019-09929-2 5. Rodionov, D., Karchkov, D., Moskalenko, V., Nikolsky, A., Osipov, G., Zolotykh, N.: Possibility of using various architectures of convolutional neural networks in the problem of determining the type of rhythm. In: Kryzhanovsky, B., DuninBarkowski, W., Redko, V., Tiumentsev, Y. (eds.) NEUROINFORMATICS 2022. SCI, vol. 1064, pp. 362–370. Springer, Cham (2022). https://doi.org/10.1007/9783-031-19032-2 38
ECG Analysis by Deep Learning
21
6. Moskalenko, V., Zolotykh, N., Osipov, G.: Deep learning for ECG segmentation. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) NEUROINFORMATICS 2019. SCI, vol. 856, pp. 246–254. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-30425-6 29 7. Rodionov, D.M., Karchkov, D.A., Moskalenko, V.A., Nikolsky, A.B., Osipov, G.V., Zolotykh, N.Yu.: Diagnostics of sinus rhythm and fibility by artificial intelligence. Probl. Inform. 1(54), 77 (2022). https://doi.org/10.24412/2073-0667-2022-1-77-88 8. Albawi, S., Mohammed, T.A., Al-Zawi, S.: Understanding of a convolutional neural network. In: 2017 International Conference on Engineering and Technology (ICET), pp. 1–6. IEEE (2017) 9. Hu, Z., Hu, Y., Liu, J., Wu, B., Han, D., Kurfess, T.: 3D separable convolutional neural network for dynamic hand gesture recognition. Neurocomputing 318, 151– 161 (2018) 10. Ismail Fawaz, H., et al.: InceptionTime: finding AlexNet for time series classification. Data Min. Knowl. Discov. 34(6), 1936–1962 (2020) 11. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015) 12. Wang, C., Chen, D., Hao, L., Liu, X., Zeng, Y., Chen, J., Zhang, G.: Pulmonary image classification based on inception-v3 transfer learning model. IEEE Access 7, 146533–146541 (2019) 13. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 14. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014) 15. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 16. Jinsakul, N., Tsai, C.F., Tsai, C.E., Wu, P.: Enhancement of deep learning in image classification performance using xception with the swish activation function for colorectal polyp preliminary screening. Mathematics 7(12), 1170 (2019)
Zero-Shot NER via Extractive Question Answering Danil Tirskikh1(B) and Vasily Konovalov2 1
2
ITMO University, St. Petersburg, Russia [email protected] Moscow Institute of Physics and Technology, Dolgoprudny, Russia
Abstract. Although the task of named entity recognition (NER) is usually solved as a sequence tagging problem via traditional supervised learning approaches which require the presence of a substantially sized annotated dataset, recent works aiming to utilize pretrained extractive question-answering (QA) models have shown significant few and zeroshot capabilities. This work aims to further investigate their applicability in zero-shot setting i.e. without explicit fine-tuning. We construct a test dataset and conduct a series of experiments to determine the shortcomings inherent in using QA models for the NER task. Our findings demonstrate such problems as weak prompt robustness, high false positive rate, trouble distinguishing between semantically close entity types and more. We suggest that addressing these problems is a crucial first step in improving the extractive QA NER approach. We conduct most of our experiments using DeepPavlov’s models within their framework.
Keywords: BERT
1
· NER · Zero-shot
Introduction
The task of named entity recognition (NER) consists of extracting all entities belonging to one of the predefined classes from a piece of text. Usually a named entity is a real-world object or an abstract concept that can be denoted with a proper name. The common approach to solving the NER task is to treat it as a sequence labeling problem, where each token is assigned a corresponding class through token level classification. Most NER datasets follow the BIO annotation scheme, in which each token is assigned a special tag together with the appropriate entity type. B followed by the entity type is assigned to the first token in an entity, in a similar fashion I is assigned to all the subsequent entity tokens. Tokens that do not belong to any of the defined entities get assigned the O tag. A typical example of a tagged sequence is shown in Table 1. Though effective, supervised sequence labeling methods require substantial amounts of annotated data to train the models, which is less than ideal in cases when there are few annotated examples available. Since labeling a large corpus of c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 22–31, 2023. https://doi.org/10.1007/978-3-031-44865-2_3
Zero-Shot NER via Extractive Question Answering
23
text requires the work of many human annotators with specific domain knowledge, creating a new dataset becomes a time-consuming and labour-intensive endeavour. Table 1. BIO annotated NER example Bill
Gates
is the co-founder of Microsoft Corporation
B-PERSON I-PERSON O O
O
O B-ORG
I-ORG
Moreover, a model created for one particular use-case cannot be easily expanded to work with new entities, so each different domain requires its own separate model. This problem is further exacerbated by the existence of multiple different languages, though some work towards creating language-agnostic NER systems has been done in the recent years [4]. All these shortcomings have in recent years led to the continuous effort in developing new few-shot approaches, that do not require much data to be adapted to new use-cases and domains. The problem of few-shot NER involves classifying novel entity types based on a small amount of annotated data. A typical few-shot problem is said to be N -way Kshot, where N denotes the number of novelty classes that need to be recognized and K stands for the number of annotated examples we have per class. When K is equal to 0, we can speak of zero-shot learning i.e. adapting a model to novel classes without presenting it with any training examples. Recent works aiming to solve the few-shot NER problem have adopted a variety of different approaches involving the use of large pretrained language models. Said approaches range from standard token classification to combined span extraction and classification methods [18]. However, while they do solve the few-shot NER problem to varying degrees of success, most of them are not suited for the zero-shot setting. One of the recent developments that aims to address the poor zero-shot capabilities of previous methods is the usage of pretrained extractive question answering models. The idea is that by leveraging the knowledge such models obtain from question-answering data during the pretraining phase, we can greatly improve low-resource NER performance [13]. Approach when the target model benefits from the knowledge gained on the other problem is called sequential transfer learning. The transfer learning techniques proved themselves effective in QA [1,9], Dialogue State Tracking [3,6] and other tasks. Furthermore, the nature of extractive question answering lends itself perfectly to solving the NER task. Nevertheless, despite being a promising solution to a long-standing problem, this approach requires further development and research that would allow us to look objectively at its strong and weak points. Therefore, in this work we aim to establish the drawbacks and caveats that come with the usage of question answering models and define areas of further research. We want to know the extent of applicability of such models to the NER task.
24
2 2.1
D. Tirskikh and V. Konovalov
Related Work Extractive Question Answering
Extractive question answering is the reading comprehension task that aims to extract a continuous span of text from the provided context based on a query. Unlike generative question answering, the answer span is selected strictly from the context and is not modified in any way. With the emergence of the transformer architecture, transformer-based models have become a popular choice for solving the extractive QA task [15]. The span prediction in models like BERT is usually done through computing the end and the start probability scores for each token in the context sequence and choosing a span i to j, j ≥ i, with the maximum sum of start and end scores [5]. One of the most popular datasets used to train QA models right now is the Stanford Question Answering Dataset (SQuAD) comprised of more than 100,000 questions based on Wikipedia. One of the things that was not addressed in the original SQuAD dataset was the case of a context sequence containing no viable answer to the proposed question, which was rectified in SQuAD2.0 with the addition of approximately 50,000 unanswerable questions written adversarially to closely resemble the answerable ones [16]. That allowed models trained on this dataset to work well in a general case when the presence of an answer is not guaranteed. 2.2
Zero-Shot NER via Extractive QA
Though traditionally NER is solved as a sequence-labeling task through tokenwise classification and there exist few-shot solutions that follow the same approach, they usually suffer from a massive imbalance towards the non-entity type tokens. That has given rise to the methods that rely on span extraction such as SpanProto that proposes a two-phase approach with separate span extraction and mention classification [18]. Query-based span extraction approaches have been used for a variety of different tasks such as relation extraction [21] and coreference resolution [20]. In decaNLP authors transform ten common NLP tasks like semantic role labeling and sentiment analysis into the question answering format for multitask learning [14]. For the NER task a unified machine reading comprehension (MRC) framework was proposed that could handle both flat and nested NER [12]. However, most early works utilizing the QA approach did not focus on few and zero-shot settings. One of the methods based on the MRC framework called QaNER proposes to use a large language model pretrained on an extractive question-answering task in order to extract the necessary spans and then classify with the appropriate entity types [13]. Unlike previous works they demonstrate that by leveraging the knowledge gained during pretraining they were able to achieve much higher results in the zero-shot setting. That is why a variation of their proposed method was chosen by us as the basis of our research.
Zero-Shot NER via Extractive Question Answering
25
More specifically, to transform a NER task into a question-answering task for each context C we generate N number of prompts based on a predefined template, where N is the number of entity types we are trying to extract. Therefore the time complexity for each context becomes O(N ). Afterwards the BIO markup for each context is reconstructed based on the predicted spans. This approach is shown on Fig. 1.
Fig. 1. NER to QA conversion scheme as described in the QaNER paper
3
Datasets
Since most existing reading comprehension datasets including SQuAD2.0 focus only on single span answers, QA models trained on them would not able to extract more than one entity of each type. In the original QaNER paper this was combated by adding repeating examples i.e. multiple instances of contextquestion pairs with different answers corresponding to each entity type instance. Then top-N spans with highest scores were chosen as predictions for each entity type. This method cannot be used with existing pretrained QA models as it requires fine-tuning. Moreover, according to a recent multi-span QA study such approaches yield poor results in detecting multiple answer spans [11]. That is why for our test benchmark we decided to include only single-span NER examples i.e. texts that don’t include any of the desired entity types more than once. We construct our benchmark based on the OntoNotes Release 5.0 English NER data which includes 18 entity types [19]. The analysis shows that approximately 85% of all sentences in OntoNotes are single-span. We guarantee that each entity type has at least 200 positive examples representing it, as it is the upper bound for some of the entities. We arrive at 3440 different contexts with 18 questions for each. The questions were created based on the “What is the [ENTITY]?” template, where [ENTITY] is either a name or a short description of an entity. For one of our experiments we fine-tune a model on a different dataset with less entity types to see how it would affect the score on entities not present during fine-tuning. We construct this dataset from the CoNLL-2003 English NER dataset leaving out the MISC entity as it encompassed the named entities
26
D. Tirskikh and V. Konovalov
not represented by other classes [17]. The three other classes LOC, PER, ORG mirror the similarly named ones in our test dataset. The dataset is then split into training and validation. More information on these datasets is presented in Table 2. Note that the negative to positive examples ratio is affected by the number of classes in a dataset, the more entity types we are trying to extract, the more negative examples per context we generate. For a context containing only one entity, we generate N − 1 negative examples, where N is the number of classes. Table 2. Data used in our research Split
4
Entities Examples Positive Negative
Test (OntoNotes) 18
61,920
0.12
0.88
Train (CoNLL)
3
25,302
0.43
0.57
Valid (CoNLL)
3
5,500
0.56
0.44
Experimental Setup
We choose a BERT-Base model pretrained on the SQUAD2.0 dataset as the base model we will use in all our subsequent experiments. We use the same training configuration for all experiments where we perform fine-tuning. The fine-tuning is performed until we reach the early-stopping criterion of not improving metrics on the validation split for 10 consecutive evaluations. To reconstruct the BIO annotated text from our predictions we employ the following algorithm. Mark all the tokens in the context sequence with the outside entity type O. Arrange all predictions in descending order according to their confidence scores. Starting from the prediction with the highest score, mark all the tokens from the predicted span with appropriate class if all them belong to the O type. Otherwise if one of the tokens in the predicted span already belongs to another entity, skip the prediction altogether. This way we ensure that predictions with higher confidence take priority. Different Prompt Schemes Comparison. In our first experiment we take our base model and evaluate it on the test dataset. We do not do any finetuning in this experiment, as it is meant to test the zero-shot capabilities of the model. We also set out to test the prompt robustness of this approach by utilizing three different sets of prompts: a) entity descriptions taken directly from the OntoNotes paper without any modifications; b) prompts created using the template described previously; c) optimal handcrafted prompts created in order to maximize metrics on our test dataset. Different Models Comparison. Our hypothesis is that the better the model performs on a question-answering dataset such as SQuAD2.0, the better the
Zero-Shot NER via Extractive Question Answering
27
results it will show in solving the zero-shot NER task following our approach. To test this hypothesis we compare our base model to a range of models that report higher scores on the SQUAD2.0 dataset. QA Pretraining Importance. In this experiment we take a clean BERT-Base model not pretrained on a question answering task and fine-tune it on our train dataset. As only three of the 18 entity types in our test dataset are covered in training, this will let us analyze the influence pretraining on a general QA dataset had on the zero-shot performance of the model. Our hypothesis is that this will not be enough to transfer knowledge to other entity types and none of the relevant spans will be extracted. Fine-Tuning Effects. In our final experiment we fine-tune our base model on the train dataset. Other than the expected rise in metrics on the overlapping entity types, we also expect to see improvement in those entities that were not present during the fine-tuning process due to the model adapting to our prompt template.
5
Results
Different Prompt Schemes Comparison. The effect different promptschemes have on the results the model demonstrates are shown in Table 3. Unlike in the few-shot setting where the QA approach quickly adapts to the provided prompts during fine-tuning and slight deviations in wording or structure don’t influence the score significantly [13], in zero-shot setting even a slight change in a prompt can influence the models predictions leading to a vastly different performance across various prompt creation schemes. Also, since predictions are made independently for each entity, there may be situations when, with an increase in metrics on all entities separately, the overall quality of the resulting markup will drop due to a conflict associated with the overlapping of different entities. Since prompts are formulated using a natural language it becomes harder to differentiate between closely related entities due to things like word synonymy and polysemy. All this makes the process of selecting optimal queries for the model more difficult. Even so the results that can be achieved with this approach are promising with our best prompts yielding F1 score of above 40. Table 3. Base model results with different prompt creation schemes Prompt type Precision Recall F1 A
21.88
16.55
18.85
B
28.75
38.80
33.03
C
42.32
43.09
42.70
Let us look more closely at the difference in the prompt types based on the TIME entity. Type A prompt is taken directly from the entity description and
28
D. Tirskikh and V. Konovalov
not adapted to the question answering format - “Times smaller than a day”. Type B prompt is created using a generic template - “What is the time?”. Type C prompt is handcrafted based on the entity description and data present in the dataset through repeated tests on a subset where only the desired entity is present - “What hour or minute?”. Different Models Comparison. To assess a model SQuAD2.0 dataset employs two metrics: EM (exact match) – calculated as the percentage of predicted answers that match the ground truth completely; F1 – calculated based on the percentage of matching tokens. We assume that a model with a higher EM will be able to find the boundaries of named entity spans more clearly, reducing the number of false positive predictions in the final markup. Since a high false positive rate can be attributed to the fact that QA models are prone to predicting longer spans that contain the answer in them, resulting in all the neighboring tokens being labels as parts of the same entity. We chose seven models with higher EM than the base model to test our hypothesis as is shown in Table 4. Table 4. Different models comparison Model
SQUAD2.0 EM F1
Test Split Precision Recall F1
BERT-Base
75.54 83.56 42.32
43.09
42.70
BERT-Large
80.88 83.87 40.45
44.02
42.16
ALBERT-XXLarge 86.11 89.35 38.04
40.47
39.22
RoBERTa-Large
26.93
32.81
86.82 89.79 41.97
ALBERT-Xlarge
84.41 87.46 26.17
29.84
27.88
DeBERTaV3-Base
83.83 87.41 22.29
26.17
24.08
RoBERTa-Base
79.93 82.95 44.43
10.95
17.58
We use the unchanged type C prompts from the previous experiment for all models. It is easy to see that higher scores on the SQuAD2.0 dataset or the larger size of a model do not indicate increased performance in the QaNER case. Therefore our initial hypothesis is not confirmed. Such behavior may be due to the over-fitting of the network on the features of a particular dataset, which negatively affects the case when there is a strong difference between the test data and training data. The SQuAD2.0 dataset and the OntoNotes dataset our test benchmark is based on are vastly different from each other. Both in terms of contexts, the median length of a context sequence for SQuAD2.0 is 107 words, whereas for OntoNotes it is 22 words, as well as in terms of questions. The prompts used in the SQUAD2.0 dataset are not only longer, but also more specific, often alluding to the exact part of the context where the answer is contained. However, since the prompts in our test benchmark were created regardless of the contexts they are applied to the model has less information to
Zero-Shot NER via Extractive Question Answering
29
rely on when making a prediction. The poorest performance is shown by the smaller models that can not confidently identify the needed entities based on the generic prompts which results in a drastic decrease in recall, though a slight increase in precision is also observed due to the fact that only the more obvious cases are labeled as their corresponding entities. While larger models are generally better at picking out the required spans, the results indicate that simply picking a larger model without changing any of the prompts is not a sure-fire way of improving performance. QA Pretraining Importance. As per our hypothesis when fine-tuning a base language model not pretrained on the question-answering task we won’t be able to extract any entities that were not present in the training data which is clearly demonstrated in Table 5. But looking closely at the results it is easy to notice something unusual. Namely the comparatively low score for the LOC entity and the presence of FAC entity. That can be attributed to the fact that different NER datasets define similar entities differently. So the LOC entity in the CoNLL2003 dataset the training data was based on is a lot broader and actually encompasses several entities from the more fine-grained OntoNotes dataset our test benchmark was based on. Which in turn leads to poor precision as similar entity types get classified as belonging to a broader class. Table 5. Results after training a new model on the train split (Only the entity types with F1 score substantially above zero are listed) Metric
PERSON LOC FAC
ORG Total
Precision 89.53
19.11 50.34 58.56 53.21
Recall
94.22
59.83 11.46 74.91 22.13
F1
91.81
28.96 18.67 65.73 31.26
Fine-Tuning Effects. After we fine-tune the base model we see an expected rise in metrics for the entities mirrored in the training data (type A), as well as a significant increase in precision for the entities that only appear during testing (type B ) as is shown in Table 6. Such results can be attributed to a decrease in false positives due to the model learning to work better with our prompt template, as well as an increase in metrics for the entities that are semantically close to those in type A. Table 6. Difference in metrics after fine-tuning the base-model Entity type Recall Precision F1 A
22.31
41.74
35.61
B
−4.45 20.69
3.08
30
6
D. Tirskikh and V. Konovalov
Conclusion
In this paper we explore an extractive question-answering approach to solving the zero-shot NER task proposed in the QaNER paper [13]. We construct a test benchmark based on the OntoNotes English NER dataset and establish a set of experiments to determine the shortcomings that come with using this approach in the zero-shot setting. We test such things as prompt robustness, performance on different models, effects of fine-tuning on different entities. We find that using QA models demonstrates impressive zero-shot capabilities but leads to difficulties differentiating between semantically close entities and high prompt sensitivity. The nature of the QA objective makes it hard to create prompts that define clear boundaries between entities causing high false positive rates. Analysis of six different models shows that the drastic difference between the pretraining QA data and the NER task formulated as a QA problem makes it hard to choose the best model beforehand, as the increase in metrics on one objective does not directly transfer to the other. Despite this, the proposed model can be used as a part of the dialogue system [8] to perform slot-filling or standalone. For future work, we will extend this work by applying the proposed approach in multi-task learning paradigm [7] to further improve the model quality. Another potential direction is to balance the training data [10] in order to improve the model performance. We integrated our model into DeepPavlov [2] framework.
References 1. Bornea, M., Pan, L., Rosenthal, S., Florian, R., Sil, A.: Multilingual transfer learning for QA using translation as data augmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 12,583–12,591 (2021) 2. Burtsev, M., et al.: DeepPavlov: an open source library for conversational AI. In: NIPS (2018). https://openreview.net/pdf?id=BJzyCF6Vn7 3. Campagna, G., Foryciarz, A., Moradshahi, M., Lam, M.S.: Zero-shot transfer learning with synthesized data for multi-domain dialogue state tracking. arXiv preprint arXiv:2005.00891 (2020) 4. Chizhikova, A., Konovalov, V., Burtsev, M.: Multilingual case-insensitive named entity recognition. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) NEUROINFORMATICS 2022. SCI, pp. 448–454. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-19032-2 46 5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 6. Gulyaev, P., Elistratova, E., Konovalov, V., Kuratov, Y., Pugachev, L., Burtsev, M.: Goal-oriented multi-task BERT-based dialogue state tracker (2020). https:// doi.org/10.48550/ARXIV.2002.02450, https://arxiv.org/abs/2002.02450 7. Karpov, D., Konovalov, V.: Knowledge transfer between tasks and languages in the multi-task encoder-agnostic transformer-based models. In: Computational Linguistics and Intellectual Technologies, vol. 2023 (2023). https:// doi.org/10.28995/2075-7182-2023-22-200-214, https://www.dialog-21.ru/media/ 5902/karpovdpluskonovalovv002.pdf
Zero-Shot NER via Extractive Question Answering
31
8. Konovalov, V., Artstein, R., Melamud, O., Dagan, I.: The negochat corpus of human-agent negotiation dialogues. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 3141–3145. European Language Resources Association (ELRA), Portoroˇz, Slovenia (2016). https://aclanthology.org/L16-1501 9. Konovalov, V., Gulyaev, P., Sorokin, A., Kuratov, Y., Burtsev, M.: Exploring the BERT cross-lingual transfer for reading comprehension. In: Computational Linguistics and Intellectual Technologies, pp. 445–453 (2020). https://doi. org/10.28995/2075-7182-2020-19-445-453, http://www.dialog-21.ru/media/5100/ konovalovvpplusetal-118.pdf 10. Konovalov, V., Melamud, O., Artstein, R., Dagan, I.: Collecting better training data using biased agent policies in negotiation dialogues. In: Proceedings of WOCHAT, the Second Workshop on Chatbots and Conversational Agent Technologies, Zerotype, Los Angeles (2016). http://workshop.colips.org/wochat/ documents/RP-270.pdf 11. Li, H., Tomko, M., Vasardani, M., Baldwin, T.: MultiSpanQA: a dataset for multispan question answering. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1250–1260 (2022) 12. Li, X., Feng, J., Meng, Y., Han, Q., Wu, F., Li, J.: A unified MRC framework for named entity recognition. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5849–5859. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.acl-main. 519, https://aclanthology.org/2020.acl-main.519 13. Liu, A.T., Xiao, W., Zhu, H., Zhang, D., Li, S.W., Arnold, A.: QaNER: prompting question answering models for few-shot named entity recognition. arXiv preprint arXiv:2203.01543 (2022) 14. McCann, B., Keskar, N.S., Xiong, C., Socher, R.: The natural language decathlon: multitask learning as question answering. CoRR abs/1806.08730 (2018). http:// arxiv.org/abs/1806.08730 15. Pearce, K., Zhan, T., Komanduri, A., Zhan, J.: A comparative study of transformer-based language models on extractive question answering. arXiv preprint arXiv:2110.03142 (2021) 16. Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for squad. arXiv preprint arXiv:1806.03822 (2018) 17. Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: languageindependent named entity recognition. arXiv preprint cs/0306050 (2003) 18. Wang, J., et al.: SpanProto: a two-stage span-based prototypical network for fewshot named entity recognition. arXiv preprint arXiv:2210.09049 (2022) 19. Weischedel, R., et al.: OntoNotes release 4.0. LDC2011T03. Linguistic Data Consortium, Philadelphia (2011) 20. Wu, W., Wang, F., Yuan, A., Wu, F., Li, J.: CorefQA: coreference resolution as query-based span prediction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6953–6963. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.acl-main. 622, https://aclanthology.org/2020.acl-main.622 21. Yang, H., Li, D.W., Li, Z., Yang, D., Qi, J., Wu, B.: Open relation extraction via query-based span prediction. In: Memmi, G., Yang, B., Kong, L., Zhang, T., Qiu, M. (eds.) KSEM 2022. LNCS, vol. 13369, pp. 70–81. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-10986-7 6
TreeCurveNet - An improved CurveNet for Tree Species Classification Che Zhang(B) , Yaowen Huang, Elizaveta K. Sakharova, Anton I. Kanev, and Valery I. Terekhov Bauman Moscow State Technical University (National Research University), Moscow, Russia [email protected]
Abstract. Nowadays, remote sensing is widely used for large-scale forest surveys. The use of LiDAR (Light Detection and Ranging, LiDAR) made it possible to obtain detailed 3D point clouds of scanned areas, significantly increasing the efficiency of identification. The breakthrough in 3D object classification has opened new opportunities for the practical application of deep learning methods to identify forest tree species, which is a key task for forest management. In this paper, we propose a CurveNet-based model more suitable for tree species classification TreeCurveNet. TreeCurveNet uses a deterministic algorithm to generate sampling curves. The results of tree classification experiments show that the TreeCurveNet model has the highest accuracy result of 86.5% compared to PointNet, PointNet + +, and CurveNet models. Keywords: Tree species classification · LiDAR · taxation · deep learning · artificial intelligence
1 Introduction A detailed analysis of the composition of forest species can provide useful information for assessing the ecological and economic value of forests, monitoring forest resources, planning and design of artificial forests and counteracting the effects of climate change. At the same time, correctly identifying tree species is important for the study of forest ecosystems [1]. Although traditional methods of studying forest biomass can provide highly accurate vegetation parameters, data collection is time-consuming and labor-intensive. Therefore, alternative methods are needed to overcome these disadvantages [2]. Consequently, the development of an effective method for classifying tree species is an urgent task. Various remote sensing data are increasingly being used to classify tree species, including LiDAR, which is a laser beam distance measurement technology. Thus, the use of a combination of laser scanning data and artificial intelligence algorithms can provide better results in classifying tree species and increase the efficiency of forest resource monitoring. Therefore, the authors propose to use pointwise deep learning. We propose an efficient method for classifying tree species based on existing point cloud recognition methods. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 32–38, 2023. https://doi.org/10.1007/978-3-031-44865-2_4
TreeCurveNet - An improved CurveNet
33
2 Related works Since the 2000s, there has been increasing interest in the use of terrestrial laser scanning (TLS) to measure individual trees and forest areas [3, 4]. However, classification of tree species using TLS data has been published only in a few studies [5]. For example, a random forest algorithm was used to recognize the three-dimensional geometric bark texture of 75 trees of five species (hornbeam, oak, spruce, beech, and pine) [6]; the resulting average classification accuracy was 85% ± 5%. Lin and Herold in their study [7] used a support vector algorithm to classify a dataset of 40 trees with four species, obtaining a maximum overall accuracy of 90.0% and a reliable overall accuracy of 77.5%. A study with a larger sample of 1200 trees and a more automated solution was presented in Åkerblom et al. [8], where the quantitative structure model (QSM) was applied to obtain tree structural features and used to classify tree species. The results of this work show that when using single-species forest patches for training and testing, it is possible to achieve an average classification accuracy above 93%. However, due to the fact that the parameters of the classification method and the combination of features were not optimized, the accuracy was much lower when pre-testing mixed species forest plots. Depending on the type of input data of the neural network, the existing methods of classification of three-dimensional objects can be divided into multiview, voxel, and point methods [9]. Multiview methods project an unstructured point cloud onto a two-dimensional image, while voxel methods convert the point cloud into a three-dimensional volumetric representation, called voxels. Proven 2D or 3D convolutional networks are then used to classify shapes. Point methods, in turn, process raw point clouds directly, without voxelization or projection. Point methods are becoming increasingly popular because there is no loss of information in this kind of data handling. PointNet [10] uses several multilayer perceptrons to model each point independently and achieves envelope invariance by combining global features through symmetric aggregation functions. In particular, PointNet learns point features independently in several MLP layers and extracts global features in the maximum aggregation layer. The PointNet algorithm cannot capture the local structural information among points because the attributes of each point are examined independently. Therefore, Qi et al. [11] proposed a layered PointNet + + to capture the fine geometric structure from the vicinity of each point. The core of the PointNet + + hierarchy is the ensemble abstraction layer, which consists of three layers: the sampling layer, the grouping layer, and the PointNetbased learning layer. PointNet + + learns the features of the local geometric structure by overlaying several ensemble abstraction layers and abstracts the local features layer by layer. Local aggregation of objects is a basic operation, which has been widely studied in recent years. In spite of the fact that the above-mentioned algorithms help to some extent to describe local patterns, at the same time, the relations between distant points are not taken into account. The authors of [12] argue that a global comparison of points may not be sufficient to extract certain patterns inherent in the point cloud. They propose a new model, CurveNet, to improve the study of point cloud geometry by generating continuous sequences of point segments. The basic idea is to group and aggregate a
34
C. Zhang et al.
sequence of points (curves). For this purpose, a curve module has been built into the model architecture (Fig. 1) to provide state-of-the-art object classification results.
Fig. 1. The CurveNet architecture.
3 Methods Due to the specificity of the point cloud of trees, we optimized the curve grouping module of CurveNet and proposed the TreeCurveNet. The TreeCurveNet uses a deterministic algorithm instead of MLP for curve generation. Our model is named TreeCurveNet because the curves generation path follows the direction of tree branches growth. We use the same definition of the curve in the point cloud as in the literature. Given P, F and an isomorphic graph G = (F, E) with connectivity E computed by the KNN algorithm on P. A curve c with length l in feature space, is generated as a sequence of point features in F, such that c = {s1 , ..., sl |s ∈ F}. To group curves, consider a wandering policy π defined on an isomorphic graph G, which starts a curve from an initial point s1 and passes l steps. Given the intermediate state of the curve si , obtained after passing through i i steps, we need to find the selection policy π(s1 ), which determines the next state of the curve in the i + 1 step. When given π, the curve c = {s1 , ..., sl } can be finally grouped by iterating the following equation for l times: si+1 = π (si ), 1 ≤ i ∈ Z + ≤ l.
(1)
Based on the characteristics of the single tree point cloud, in our work a selection algorithm π was developed to avoid loops and obtain the curves that best describes the features of the point cloud.
TreeCurveNet - An improved CurveNet
35
The morphological structure of a tree consists of branches branching off from the main trunk, which in turn branch out further, producing more branches. Different kinds of trees have different bifurcation structure, so a good set of curves should reflect this bifurcation. From the main trunk of a tree, you can always get to any of its branches, so the starting point of our selection algorithm is the main trunk of the tree. To select the starting point of the curve, we first find from a point cloud P the lowest point on the z-axis (i.e. the axis parallel to the direction of tree growth) and n − 1 n-equivalent points of the z-axis, which form a set of base planes B = {{z = b1 }, ..., {z = bn }} perpendicular to the z-axis. bi Then find the m nearest points p1bi , ..., pm from the point cloud P to each reference } {z ∈ B. To avoid noise interference in the point cloud, we find the center plane = b i bi bi bi Ri x , y , z . For each base plane {z = bi } ∈ B, can find the corresponding root point Ri . Finally, for each root point Ri can find a set of k nearest points N = p1Ri , ..., pkRi in the point cloud P. N = {N1 , ..., Nn } is the set of all starting points of curves. Starting points obtained by this method can be located on the main trunk of the tree at different heights. Our selection algorithm π produces curves from the trunk to the tops of the tree. In our method, for curve c, the state si is the coordinates of the last point pi , the penultimate pi-1 of the curve, the feature f i corresponding to point pi and k nearest neighbors point p1si , ..., pksi of point pi . p1si , ..., pksi are evaluated, and the point with the highest score becomes the new last point of curve. The score is divided into three parts: vertical, horizontal and inertial. In order to plot the curve in the direction of tree growth, the scores of the points above point pi must be raised. Therefore, the cosine of the angle between the vector starting at the last point pi of the curve c and ending at pjsi , and the standard vector in the positive direction of the z-axis is chosen as the vertical score of pjsi ∈ p1si , ..., pksi . The trunk of the tree will split after a certain height and grow horizontally. So that the curve can form along the branches, we add to the vertical score a coefficient l−i l , which decreases as the length of the curve increases. By adding this coefficient, the effect of the vertical score on the overall score decreases as the curve spreads higher up the tree. If the distances between the curves are too small or if the curves overlap, then, on the one hand, the features will be extracted repeatedly and, on the other hand, some features will not be extracted. Therefore, a horizontal score is designed to spread the curves evenly from the trunk in direction and to reduces the back loop. each horizontal The horizontal score of pjsi ∈ p1si , ..., pksi is the cosine of the angle between the vector starting at the last point pi of curve c and ending at pjsi , and the standard vector of → the initial horizontal direction − v . Horizontal scores k1 · 2π, k2 · 2π, ..., 2π (k = |Ni |) are randomly assigned to starting points Ni ∈ N with the same height. We add to the horizontal score coefficient li , which increases as the curve length increases, because the tree has fewer forks in the lower positions, and in the upper positions the branches grow more horizontally. The inertial part was designed to reduce loops and smooth the curve. The inertial score of pjsi ∈ p1si , ..., pksi is the cosine of the angle between the vector starting at the
36
C. Zhang et al.
last point pi of the curve c and ending at pjsi , and the vector starting at the penultimate point pi-1 and ending at the last point pi . In summary, the score of one of the k nearest neighbors of the point pi (pjsi ∈ si p1 , ..., pksi ) can be described by Eq. 2. Score =
l−i i cos θ1 + cos θ2 + cos θ3 l l
(2)
−−→ θ1 - angle between pi pjsi and the z-axis in the positive direction; −−→ → θ2 - angle between pi pjsi and − v (initial horizontal direction); −−→ si − θ - angle between p p and p−−→ p (Fig. 2). 3
i j
i−1 i
Fig. 2. The three angles used to calculate the score
The purpose of curve aggregation is to enrich the intra-channel diversity of relative encoding features and ultimately provide a better description of graph G. As in CurveNet [12], we used a similar aggregation method, and embed the curve grouping block (CG) and curve aggregation block (CA) into the Curve Intervention Convolution (CIC) block, and put 8 CIC blocks together to build a network, which we call TreeCurveNet. In each CIC block, the curves are first grouped and then combined with all the pointwise features.
4 Experiements For the experiment we used four deep learning models: PointNet, PointNet + +, CurveNet and TreeCurveNet. From scratch, we trained them on a dataset consisting of 354 trees collected in Central Russia. The data can be divided into three categories based on tree species: spruce - 242, pine - 160, birch - 104. To decrease computational cost we used Farthest Point Sampling (FPS) to decrease the number of points. For training each of the models we chose the following hyperparameters: batches size 32, number of points - 2048, number of epochs - 200, optimizer -
TreeCurveNet - An improved CurveNet
37
SGD, learning rate coefficient - 0.1, weight decay - 10–4, momentum - 0.9. The PyTorch library (1.10.0 + CUDA 11.3) was used to implement the methods. The video card used in this study was an NVIDIA GeForce RTX 3080 (12 GB). Table 1. Multiclass accuracy of algorithms. Methods
Accuracy
Precision
PointNet
71,5%
73,2%
PointNet + +
75,7%
76,5%
CurveNet
84,6%
85,9%
TreeCurveNet
86,5%
86,7%
Based on the experimental results obtained after training and testing the model on training and test data, respectively, we concluded that the TreeCurveNet model recognizes tree species better than the other algorithms. The maximum accuracy was equal to 86.5%. In addition, CurveNet and TreeCurveNet are much more accurate than PointNet and PointNet + + (Table 1).
Fig. 3. Error matrices
38
C. Zhang et al.
The error matrices show that TreeCurveNet is able to identify the undersampled categories (birch) more accurately than CurveNet. This may be due to the reduced number of parameters when using the deterministic curve generation algorithm (Fig. 3).
5 Conclusion In this paper, four deep point cloud learning methods were used to study the point cloud classification of individual trees. One of these methods, called TreeCurveNet, is our improved method based on CurveNet. According to the results, TreeCurveNet can more accurately identify tree species information in individual tree point clouds compared to CurveNet. This is due to the fact that TreeCurveNet uses a curve grouping method, which is suitable for extracting tree features and avoids the CurveNet performance drop when training on small samples. Both methods have significantly higher classification accuracy than PointNet and PointNet + +, since curve-based feature extraction can better account for long-range and local features.
References 1. Sakharova, E.K., Nurlyeva, D.D., Fedorova, A.A., Yakubov, A.R., Kanev, A.I.: Issues of tree species classification from LiDAR data using deep learning model. In: Advances in Neural Computation, Machine Learning, and Cognitive Research V: Selected Papers from the XXIII International Conference on Neuroinformatics, October 18–22, 2021, pp. 319–324. Russia. Springer, Moscow (2022) 2. Grishin, I.A., Sakharova, E.K., Ustinov, S.M., Kanev, A.I., Terekhov, V.I.: Tree Inventory with LiDAR Data. In: Advances in Neural Computation, Machine Learning, and Cognitive Research VI: Selected Papers from the XXIV International Conference on Neuroinformatics, October 17–21, 2022, pp. 3–11. Russia. Springer, Moscow (2022) 3. Newnham, G.J., et al.: Terrestrial laser scanning for plot-scale forest measurement. Current Forestry Reports 1, 239–251 (2015) 4. Li, J., Hu, B., Noland, T.L.: Classification of tree species based on structural features derived from high density LiDAR data. Agric. For. Meteorol. 171, 104–114 (2013) 5. Puttonen, E., et al.: Tree species classification from fused active hyperspectral reflectance and LIDAR measurements. For. Ecol. Manage. 260, 1843–1852 (2010) 6. Othmani, A., Voon, L.F.L.Y., Stolz, C., Piboule, A.: Single tree species classification from terrestrial laser scanning data for forest inventory. Pattern Recogn. Lett. 34, 2144–2150 (2013) 7. Lin, Y., Herold, M.: Tree species classification based on explicit tree structure feature parameters derived from static terrestrial laser scanning data. Agric. For. Meteorol. 216, 105–114 (2016) 8. Åkerblom, M., Raumonen, P., Mäkipää, R., Kaasalainen, M.: Automatic tree species recognition with quantitative structure models. Remote Sens. Environ. 191, 1–12 (2017) 9. Guo, Y., Wang, H., Hu, Q., et al.: Deep learning for 3d point clouds: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 43, 4338–4364 (2020) 10. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660 (2017) 11. Qi, C.R., Yi, L., Su, H., Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in Neural Information Processing Systems 30 (2017) 12. Xiang, T., Zhang, C., Song, Y., et al.: Walk in the cloud: Learning curves for point clouds shape analysis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 915–924 (2021)
Dialogue Graphs: Enhancing Response Selection Through Target Node Separation Grigory Minakov, Mumtozbek Akhmadjonov, and Denis Kuznetsov(B) Moscow Institute of Physics and Technology, Dolgoprudny, Russia {minakov.ga,akhmadzhonov.mk,kuznetsov.den.p}@phystech.edu
Abstract. This paper proposes a new method called Target Node Separation to address the problem of accurately performing the response selection task in dialogue systems. The proposed method enhances the performance of response selection tasks by refining the graph structure through improving edge ends. Authors compare the proposed method to other state-of-the-art methods on the MultiWoZ dataset and find that the proposed approach outperforms other graph-based methods and SBERTMap on recall metrics. Furthermore, the authors observed that increasing the number of clusters results in an improvement in the performance of the dialogue graph. Keywords: Dialogue systems · Response selection Clustering · Intents · Graph neural network
1
· Dialogue graph ·
Introduction
Dialogue agents are widely used nowadays [6]: from virtual assistants, that provide a convenient interface with smart devices, to chat-bots for offering different services. The main purpose of using such conversational agents is to automate dialogues [20]. There are two main paradigms in building intelligent dialogue systems: task-oriented (a.k.a. goal-oriented) dialogue systems [8,27] for task-specific functions, and open-domain dialogue systems for non-goal-oriented chitchat [1,16,29]. Along with these two groups, QA systems are of a high research interest [5,10] as well. All approaches used for building conversational agents are now almost fully data-driven, supported by modular or end-to-end machine learning frameworks [15,21]. Large conversational corpora are used to train and evaluate a variety of models for conversational response generation [3,30] or selection [12,18]. Response selection is the main task of investigation in this research, and the task is to identify a correct response to a given conversational context from a set of candidates. The study in [19] focused on solving this task better than other modern approaches on goal-oriented dialogues by constructing dialogue graphs automatically. The success of dialogue graphs in this task is due to the regular structure of such dialogues and their representation as chains of intents. The c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 39–53, 2023. https://doi.org/10.1007/978-3-031-44865-2_5
40
G. Minakov et al.
main assumption was that intents formed Markov chains, which were modelled using dialogue graphs. The primary contributions of this paper can be summarized as follows: – We introduced a novel approach called Target Node Separation, which enhances the performance of the dialogue graph in response selection task. – We conducted a thorough analysis of the impact of the number of clusters on performance and demonstrated that increasing the cluster count is beneficial for the graph. All code is available here (anonymized link, see submitted archive) and distributed under the Apache 2.0 License.
2
Related Work
Dialogue Graph Auto Construction. The main assumption of the research in this paper is that dialogues have a regular structure in most cases, and the structure can be successfully embodied into a dialogue graph. The are several ways to create and represent dialogue graphs. The obvious way is to store whole dialogue states with all the received information in graph nodes interpreting edges as actions (or intents) [9]. Another approach is Graph2Bots [4], where the utterances are grouped into nodes using bi-clustering, and the edges are the transitions between the nodes in the dialogues from the dataset. The method introduced in [19] is quite similar to Graph2Bots, but it employs a different clustering approach which involves using K-means [24] on utterance embeddings to create nodes. The dialogue graph construction process can be extended to several stages using a modified node2vec algorithm [22]. Eventually, the nodes contain utterances that are semantically or contextually similar and have the same intents. The graph from this method can have be bipartite or have more lobes: for each speaker a graph can contain a lobe with the nodes consisting of only their utterances. In our studies, the dialogue graph has only one lobe. Linking Prediction. The simplest way to draw edges between the nodes of a dialogue graph is counting the frequencies of transitions. However, by treating a dialogue as a path in a dialogue graph as in the method described above, predicting the next intent is equivalent to predicting the next node in a graph. The problem can be formulated as a multi-class or a binary classification problem [2]. Response Selection. There are several techniques available for solving response selection tasks in dialogues. One effective approach is to use pretrained language models to create embedded vectors that capture the overall meaning of both the dialogue context and the potential responses. This technique is utilized in ConveRT [12], a lightweight transformer network [26] that generates embedding vectors suitable for scoring responses. Another option is to utilize the sentencebased embedding network known as SentenceBERT [23].
Target Node Separation
41
DGAC [19] can also be used to solve the task of best response selection, and it has shown a performance boost in task-oriented dialogues from MultiWOZ 2.2 [28]. Dialogue graph is used to restrict the set of possible candidates for the next response by collecting them greedily from the top specified number of next clusters. So the problem of response selection heavily depends on the quality of linking prediction in the dialogue graph. However, the method has not been tested on other domains or dialogues types, so we have explored the abilities of dialogue graphs on different datasets. In order to effectively assess the accuracy of response selection models, it is necessary to employ appropriate datasets. Specifically, datasets that are goaloriented, like MultiWOZ, are advantageous due to the extensive contextual information and comprehensive understanding of user objectives that are necessary for generating accurate responses. On the other hand, question-answering datasets, such as AmazonQA [10], can also benefit from the methods discussed.
3
Datasets
Dialogue corpora are not of the primary focuses of this research. Dialogue graphs are constructed from conversational datasets, and, generally, the graph contains all of the utterances of the corpora in itself. The datasets used in this study are described below. It is important to note, that our method treats dialogue datasets as sequences of utterances, therefore any additional annotations or metadata were not necessary for our work, as we focused on the probabilistic structure of conversations rather than utterance-level attributes. To assess the method’s generalizability, we tested it on open-domain and QA datasets, detailed description for used datasets provided in Appendix A.1.
4
Response Selection
This section details the proposed approach for response selection, which involves constructing dialogue graphs automatically. Specifically, we will discuss the process of creating dialogue graphs, as well as constructing them using TNS to enhance the performance of response selection. 4.1
Embedders
In order to effectively process textual data, it is essential to create an appropriate numerical representation. This representation can either be sparse and discrete, or dense. For our purposes, we utilized dense vector representations generated by Sentence Transformers [23].
42
4.2
G. Minakov et al.
Non-graph Models
For the response selection task, a method called MAP, as described in [11], can be employed using single utterance embeddings obtained from BERT [7]. This approach involves learning a linear mapping on top of the utterance vector and performing selection by comparing the suggested utterance vectors with the mapped ones. However, Sentence BERT [23], a more accurate sentence embedder, has been shown to be highly applicable for NLI [25] tasks. It was trained in a siamese manner to create vector representations of semantically similar sentences that have a larger similarity in a given measure. In this paper, we utilize Sentence BERT instead of BERT and refer to this method as SBERTMap. 4.3
Graph Models
Dialogue Graph. This subsection provides a brief overview of the original method for automatically constructing dialogue graphs, with detailed information available in [19]. The core concept of the approach is to represent a dialogue as a chain of unsupervised intents connected probabilistically. The original method constructs a Markov chain dialogue graph with states and transitions. The state of the graph is represented as a cluster of utterance embeddings that is induced by clustering and is considered as an intent. Graph edges are established by counting the frequencies of transitions between clusters in the dialogue dataset used to construct the graph. The process of building a graph is either a single stage, or can be split into several stages. In the one-stage method, all utterances in the dialogue corpus are embedded into a latent vector space using the embedders described earlier in Section. The embedding vectors are then clustered into a specified number of groups, and, finally, the probabilistic transitions are built. In the two-stage method, the first stage is identical to the one-stage method, however a significantly larger number of clusters are formed. The clusters are then clustered into groups using cluster2vec, also described in the original paper, and all vertices within large clusters are merged. Transitions are then recomputed for the newly formed vertices. For our research we only considered the one-stage method as a baseline and introduces our enhancements for this variant. However, we hypothesize that the provided improvements can boost the two-stage method as well. Dialogue Graph with TNS. The primary contribution of our paper is the proposal of a new method to enhance the quality of response selection tasks, which we have named Target Node Separation. This section provides a comprehensive explanation of the method. The Target Node Separation method is a modification of the single stage baseline approach proposed in [19]. In this method, we augment the baseline constructed dialogue graph with additional data by computing the averaged target utterance embedding for all consecutive utterance pairs in the training set that belong to each edge. This modification can be expressed using the following
Target Node Separation
43
formula. Let the new vector associated with the edge from the i-th cluster to the j-th cluster be denoted as vij . Let c(x) denote the cluster of utterance x and e(x) denote the embedding vector for the utterance. Let the count of edges from cluster i to cluster j be denoted as Nij . Then, the formula for the new vector vij can be written as follows: vij =
1 Nij
e(y).
(1)
c(x)=i, c(y)=j
Since only the most probable edge from any cluster is used during evaluation for response selection, it is sufficient to compute additional vectors for those edges only. This approach significantly reduces memory consumption and allows for the utilization of more clusters in computational experiments. Essentially, this reduction transforms the whole graph into a mapping of clusters to target vectors. By connecting two clusters to vectors belonging to the same cluster, at least one cluster remains unused in the target space, creating a cluster that is used at least twice. Thus, the method provides a technique for eliminating unimportant clusters in target space by substituting them with ones obtained from separating others. It should be noted that the baseline dialogue graph approach can also be expressed in terms of mappings, but it differs from the Target Node Separation method in that it does not map to the cluster centroids anymore. Figure 1 visualizes the difference between this method and the original dialogue graph approach. This method can be viewed as constructing a more representative target feature space for the task. 4.4
Metrics
In order to assess the accuracy of the proposed methods, it is necessary to devise an appropriate metric that meets certain requirements. Specifically, the metric should be clustering independent, such that the method does not directly benefit from increasing or decreasing the number of utterance clusters. Furthermore, the metric should enable comparison with methods that do not involve clustering at all. Additionally, the metric must exclude any test exploitation, meaning that it should not utilize responses from the training phase during evaluation. If such information were to be utilized, the model may attempt to dishonestly simplify the selection task. Moreover, the metric should be embedding independent, meaning that it should not rely explicitly on any particular method of embedding utterances into a vector space. This type of dependency, while potentially simplifying the scoring approach, could bind the scoring results to the quality of the selected embedding and unreasonably benefit the use of similar embeddings in the model. We employed the Recalln@k metric for our evaluation, which comprises the following steps: for each successive pair of utterances in the training set, the model is provided with the first utterance and prompted to choose the subsequent utterance from a constructed set. To create this set, the entire test utterance set is randomly sampled n − 1 times, and the gold utterance is included in the set as
44
G. Minakov et al.
well. Then, the model selects the k most relevant utterances from this randomly permuted set. If the gold utterance is part of the selected set, the result for this pair is considered to be 1; otherwise, it is 0. By aggregating these results, Recalln@k is computed as the average result among all successive pairs. This Recalln@k metric has been utilized in prior research: [11,12].
Fig. 1. The comparison between the Dialogue Graph without (a) and with Target Node Separation (b) is presented in the figure. The main differences between the two approaches are illustrated, where the baseline approach maps the entire source cluster to its target centroid, while Target Node Separation maps it to the averaged target vector. In the original Dialogue Graph approach, if two different clusters are linked to the same cluster, they will be mapped to the same point (centroid). However, in Target Node Separation, different points may be assigned based on the source cluster
5 5.1
Experiments Setup
We base our training on MultiWoZ dataset [28], which was divided into two subsets. The evaluation stage involved 1000 dialogues, while the training stage consisted of 8432 dialogues. In our study, we analyzed the performance of three different methods, namely SBERTMap, the baseline one-stage dialogue graph, and Target Node Separation. To evaluate the methods, multiple deterministic random seeds were used, affecting the KMeans clustering [24], the initialization of SBERTMap weights, and the random selection of responses for selection. However, it should be noted that the provided dialogue turn embeddings are not modified or regenerated in any way.
Target Node Separation
5.2
45
Dialogue Graph and Dialogue Graph with TNS
The Recall100@1 metric was implemented to evaluate the effectiveness of the models. This metric mandates that the model select a single optimal utterance. The graph-based methods were evaluated across different cluster quantities, while the SBERTMap approach is independent of clustering. The SBERTMap was included to compare graph-based approaches to the most basic approach of utilizing utterance embeddings. We trained SBERTMap using a batch size of 256. Graphs are constructed using SentenceBERT [23] embedding vectors. In order to select an appropriate utterance, all the methods under investigation follow a similar procedure. First, an expected point in embedding space is generated, and then the closest point to the query is selected. While ConveRT embeddings utilized in the baseline approach [19] have demonstrated improved performance, they are susceptible to test exploitation, which we sought to avoid in our evaluation. Instead, we utilized SentenceBERT embeddings in both graphbased methods and SBERTMap to ensure fair comparison across approaches. We also scored pretrained ConveRT network on the task. 5.3
Optimal Cluster Count on MultiWoZ
In order to determine the optimal number of clusters for the response selection task, we utilized a more advanced method due to the high computational costs associated with experiments involving a large number of clusters. Our approach involved reducing the training set by a factor that increased incrementally. We observed that when the number of clusters was equal to the number of utterances in the training set, the scenario was similar to the one-nearest neighbor approach. However, the one-nearest neighbor approach was found to be more effective. Therefore, we evaluated the Target Node Separation approach on the original training set and a training set reduced by powers of two, as well as the onenearest neighbor approach, which involved extending the number of clusters in the original experiment. The number of clusters was limited to the number of utterances in the training set, which was a natural limitation arising from the clustering constraints. 5.4
Consecutive Utterance Embedding for Enhanced Clustering
To explore the potential advantages of incorporating all consecutive utterance information in clustering, which may be a contributing factor to the good performance of the Target Node Separation method, we carried out an experiment. Specifically, for each pair of consecutive utterances in the training set, denoted as an edge, we constructed a new embedding using a convex combination of the embeddings of the two utterances obtained from the SentenceBERT [23] embedding. During evaluation, we obtained the edge embedding using the following procedure: for the first utterance in the edge, we found the nearest utterance in the training set and used its edge embedding. This allowed us to utilize edges in clustering while excluding test set exploitation. We used these edge embeddings
46
G. Minakov et al.
instead of utterance embeddings in the one-stage Dialogue Graph [19] building procedure and evaluated the new Dialogue Graph for response selection. We evaluated the method on the MultiWoZ [28] dataset using various values of α. Notably, when α = 0, the proposed method is equivalent to the original Dialogue Graph approach.
6
Results and Discussion
The performance of the one stage Dialogue Graph approach, Target Node Separation, and SBERTMap are depicted in Fig. 2. Each method was evaluated three times and the figure shows the two standard deviation intervals for each evaluation. In terms of performance, both graph-based methods outperform SBERTMap for any sufficiently large cluster count. Target Node Separation outperforms SBERTMap when the number of clusters is greater than 50, whereas the original Dialogue Graph approach requires 150 clusters to surpass SBERTMap. Across all cluster counts, the Target Node Separation approach outperforms the baseline approach in the response selection task, indicating the superiority of the proposed method.
Fig. 2. The performance of the Dialogue Graph approach, Target Node Separation (proposed in this paper), and SBERTMap on MultiWoZ dataset. Notably, SBERTMap is not dependent on clustering, whereas Target Node Separation enhances selection by predicting per-link averaged centers instead of cluster centroids.
The performance of the proposed methods was also evaluated on additional datasets, and the results are presented in Table 1, where Recall100@1 metric values are reported relative to the SBERTMap approach. On the MultiWoZ,
Target Node Separation
47
OpenSubtitles, and PersonaChat datasets, the Dialogue Graph approach demonstrated superior performance compared to the SBERTMap approach. Furthermore, the incorporation of Target Node Separation further enhanced the performance of the Dialogue Graph approach on all of the evaluated datasets. Results on the other datasets are provided in Appendix A.3. Table 1. The Recall100@1 metric values for the response selection performance of Baseline DG and Target Node Separation relative to SBERTMap are presented in the table. Positive values indicate cases where the approach outperforms SBERTMap. Dataset
Baseline DG Target Node Separation
MultiWoZ
0.013
0.019
AmazonQA
−0.177
−0.175
FoCus
−0.022
0.001
DailyDialog
−0.049
−0.043
OpenSubtitles 0.047
0.056
PersonaChat
0.048
0.040
In Fig. 3, we present the response selection performance of the method trained on the reduced training set. The results show that the one nearest neighbour method outperforms all evaluated methods, indicating that increasing the cluster count is generally beneficial for this approach. This observation is supported by the fact that the performance increases with the number of clusters for every reduced training set. Interestingly, in this scenario, the original Dialogue Graph approach is equivalent to the Target Node Separation method. Given that these methods share the same nature, the Target Node Separation method requires fewer clusters, i.e., fewer dimensions in the feature space to build meaningful representations. Thus, the representative dialogue graph could still be used for other purposes and perform selection tasks requiring significantly less space and time than the one nearest neighbour approach. The results for the enhanced clustering method through the convex combination of utterance embedding are provided in Appendix A.2. The baseline approach outperforms the edge clustering methods for all α parameters. Furthermore, increasing the α value, which implies relying more on the edge source, results in higher accuracy. This suggests that the dialogue graph should be constructed on utterance clusters rather than edge clusters. Therefore, the good performance of the Target Node Separation method may be attributed to utilizing the structure of consecutive utterances in a different way.
48
G. Minakov et al.
Fig. 3. Response selection performance of the Target Node Separation method trained on a reduced training set, compared to the one nearest neighbor approach which can be viewed as a dialogue graph with the same number of clusters as the total number of utterances in the original training set.
7
Conclusion
In this research, we introduced a new method called Target Node Separation, which utilizes a one-stage dialogue graph for response selection tasks. This approach not only enhances the performance of response selection tasks compared to other methods that use the same number of clusters, but it also improves the graph structure by refining edge ends. Except for AmazonQA, Dialogue Graph with TNS achieved better performance than all other methods on all datasets Furthermore, we conducted experiments to compare the performance of different methods for response selection tasks. The results showed that the one nearest neighbor approach outperforms other methods, including the Dialogue Graph and our proposed method. We also observed that increasing the number of clusters leads to better performance for the dialogue graph, indicating that the use of more clusters can capture more detailed information and enhance the representation of the dialogues. Through our experiments with edge clustering, we have demonstrated that clustering directly on the utterances results in better performance for Dialogue Graph. Overall, our study demonstrates the effectiveness of Target Node Separation in response selection tasks and provides insights into the impact of cluster count on the performance of dialogue graphs.
A A.1
Appenix Description of Datasets
MultiWOZ [28]. It is a collection of task-oriented dialogues from 7 different domains, comprising over 110.000 utterances. The dataset was collected in a
Target Node Separation
49
human-to-human manner in the Wizard-of-Oz (WOZ) framework [14]. Originally, the dataset was constructed for dialogue-state tracking tasks and has high-quality annotations for dialogue acts. This dataset was chosen, because it is large-scale, goal-oriented and covers many domains. Additionally, the baseline method used this dataset, so we leveraged it in our approach for a fair comparison. DailyDialog [16]. Unlike MultiWOZ, DailyDialog is an open-domain chitchat dataset. This is a manually labeled multi-turn dialogue dataset containing approximately 13,000 daily conversations on 10 different topics, with annotations for emotion and dialogue acts. The language of the conversations is free of slang and more formal than other dialogue corpora, as the dataset was collected from different websites for learning English. The dataset was chosen for its simplicity and non-goal-orientedness. PersonaChat [29]. This is an open-domain, crowd-sourced dataset containing over 10.000 conversations between two speakers, with 1155 possible personalities for each speaker. A personality consists of 5–6 sentences. The main purpose of the dataset is to build dialogue agents that leverage their own personality or the personality of the other speaker, thus increasing engagement and dialog consistency. However, we didn’t consider persona sentences and treated this dataset as just an open-domain conversational corpus. We attempted to extract regular structures from its dialogues. FoCus. Introduced in [13], FoCus is a set of about 14.000 human-machine conversations regarding geographical landmark guidance. Each dialogue includes a persona of the user consisting of 5 sentences, and a knowledge base relevant to the conversation, which is an article from Wikipedia. Majority of the dialogues are goal-oriented where a user asks the bot about a specific geographical location, so we assume that all conversations follow probabilistic templates which we investigated in this work. AmazonQA [10]. The dataset is composed of questions and answers scraped from Amazon product pages. The full dataset has approximately 3.6 billion examples split into different categories. We tested our method on “automotive” and “video games” QA categories. In cases where a question has multiple answers in the dataset, we only observed questions with a single answer. Additionally, we only considered examples where the answer contains at least four words to make the question-answering more informative. We chose this dataset because it contains real-world data, mostly with descriptive answers. As QA pairs cannot logically form full dialogues, the dialogue graph constructed from such corpora becomes very sparse. However, it is interesting to consider whether such graphs can enhance best response selection. OpenSubtitles [17]. This is a growing online collection of subtitles for movies and television shows in multiple languages. Conversations are formed from consecutive lines of subtitles, but there is no mapping between different speakers and lines. Consecutive lines may not belong to the same scene or the show as
50
G. Minakov et al.
well. The English dataset has about 440 million lines, split into chunks of 100.000 lines each. We used 2 chunks and split every 10 consecutive lines into a single dialogue. Then, split those dialogues into the train and test sets in a 9:1 ratio. Despite its non-standard structure, this dataset can still be used to model the mapping from conversational contexts to responses. A.2
Edge Clustering
(See Fig. 4).
Fig. 4. Performance of response selection on Dialogue Graph with edge clustering, where low alpha values indicate clustering based on the accumulation of target utterances, and values approaching one indicate the baseline clustering using only the current utterance. Each method was evaluated twice, and the figure shows the two standard deviation intervals for each evaluation.
Formally edge embeddings are constructed as follows. Let e(·) be the embedding function given an edge (u1 , u2 ) and an arbitrary parameter α ∈ [0, 1]. We denote the edge embedding as e(u1 , u2 ) = e(u1 ) · α + e(u2 ) · (1 − α). A.3
Performance on All Datasets
(See Fig. 5).
(2)
Target Node Separation
51
Fig. 5. Performance on different datasets
References 1. Adiwardana, D., et al.: Towards a human-like open-domain chatbot. CoRR abs/2001.09977 (2020). https://arxiv.org/abs/2001.09977 2. Banerjee, S., Khapra, M.: Graph convolutional network with sequential attention for goal-oriented dialogue systems. Trans. Assoc. Comput. Linguist. 7, 485–500 (2019). https://doi.org/10.1162/tacl_a_00284 3. Bi, B., Li, C., Wu, C., Yan, M., Wang, W.: PALM: pre-training an autoencoding&autoregressive language model for context-conditioned generation. CoRR abs/2004.07159 (2020). https://arxiv.org/abs/2004.07159
52
G. Minakov et al.
4. Bouraoui, J.L., Le Meitour, S., Carbou, R., Barahona, L.M.R., Lemaire, V.: Graph2bots, unsupervised assistance for designing chatbots. In: Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, pp. 114–117 (2019) 5. Bouziane, A., Bouchiha, D., Doumi, N., Malki, M.: Question answering systems: survey and trends. Procedia Comput. Sci. 73, 366–375 (2015). https://doi. org/10.1016/j.procs.2015.12.005, https://www.sciencedirect.com/science/article/ pii/S1877050915034663. International Conference on Advanced Wireless Information and Communication Technologies (AWICT 2015) 6. Burtsev, M.S., et al.: DeepPavlov: Open-source library for dialogue systems. In: ACL (4), pp. 122–127 (2018) 7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018). https://doi.org/10. 48550/ARXIV.1810.04805, https://arxiv.org/abs/1810.04805 8. El Asri, L., et al.: Frames: a corpus for adding memory to goal-oriented dialogue systems. In: Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 207–219. Association for Computational Linguistics, Saarbrücken (2017). https://doi.org/10.18653/v1/W17-5526, https://aclanthology.org/W17-5526 9. Gritta, M., Lampouras, G., Iacobacci, I.: Conversation graph: data augmentation, training, and evaluation for non-deterministic dialogue management. Trans. Assoc. Comput. Linguist. 9, 36–52 (2021). https://doi.org/10.1162/tacl_a_00352 10. Gupta, M., Kulkarni, N., Chanda, R., Rayasam, A., Lipton, Z.C.: AmazonQA: a review-based question answering task. CoRR abs/1908.04364 (2019). http://arxiv. org/abs/1908.04364 11. Henderson, M., et al.: A repository of conversational datasets. CoRR abs/1904.06472 (2019). http://arxiv.org/abs/1904.06472 12. Henderson, M., Casanueva, I., Mrkšić, N., Su, P.H., Wen, T.H., Vulić, I.: ConveRT: efficient and accurate conversational representations from transformers, pp. 2161– 2174 (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.196 13. Jang, Y., et al.: Call for customized conversation: customized conversation grounding persona and knowledge. CoRR abs/2112.08619 (2021). https://arxiv.org/abs/ 2112.08619 14. Kelley, J.F.: An iterative design methodology for user-friendly natural language office information applications. ACM Trans. Inf. Syst. 2(1), 26–41 (1984). https:// doi.org/10.1145/357417.357420 15. Li, X., Panda, S., Liu, J., Gao, J.: Microsoft dialogue challenge: building end-toend task-completion dialogue systems. CoRR abs/1807.11125 (2018). http://arxiv. org/abs/1807.11125 16. Li, Y., Su, H., Shen, X., Li, W., Cao, Z., Niu, S.: DailyDialog: a manually labelled multi-turn dialogue dataset. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 986– 995. Asian Federation of Natural Language Processing, Taipei (2017). https:// aclanthology.org/I17-1099 17. Lison, P., Tiedemann, J.: OpenSubtitles2016: extracting large parallel corpora from movie and TV subtitles. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 923–929. European Language Resources Association (ELRA), Portorož (2016). https://aclanthology.org/ L16-1147 18. Lowe, R., Pow, N., Serban, I., Pineau, J.: The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. CoRR abs/1506.08909 (2015). http://arxiv.org/abs/1506.08909
Target Node Separation
53
19. Nagovitsin, M., Kuznetsov, D.: DGAC: dialogue graph auto construction based on data with a regular structure. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) NEUROINFORMATICS 2022. SCI, vol. 1064, pp. 508– 529. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19032-2_52 20. Patlan, A.S., Tripathi, S., Korde, S.: A review of dialogue systems: from trained monkeys to stochastic parrots. CoRR abs/2111.01414 (2021). https://arxiv.org/ abs/2111.01414 21. Ramadan, O., Budzianowski, P., Gašić, M.: Large-scale multi-domain belief tracking with knowledge sharing. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 432–437. Association for Computational Linguistics, Melbourne ( 2018). https://doi.org/10. 18653/v1/P18-2069, https://aclanthology.org/P18-2069 22. Ramage, D., Rafferty, A.N., Manning, C.D.: Random walks for text semantic similarity. In: Proceedings of the 2009 workshop on graph-based methods for natural language processing (TextGraphs-4), pp. 23–31 (2009) 23. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2019). https://arxiv.org/abs/1908.10084 24. Steinley, D.: K-means clustering: a half-century synthesis. Br. J. Math. Stat. Psychol. 59, 1–34 (2006). https://doi.org/10.1348/000711005X48266 25. Storks, S., Gao, Q., Chai, J.Y.: Commonsense reasoning for natural language understanding: a survey of benchmarks, resources, and approaches. CoRR abs/1904.01172 (2019). http://arxiv.org/abs/1904.01172 26. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf 27. Williams, J.: A belief tracking challenge task for spoken dialog systems. In: NAACL-HLT Workshop on Future Directions and Needs in the Spoken Dialog Community: Tools and Data (SDCTD 2012), pp. 23–24. Association for Computational Linguistics, Montréal (2012). https://aclanthology.org/W12-1812 28. Zang, X., Rastogi, A., Sunkara, S., Gupta, R., Zhang, J., Chen, J.: MultiWOZ 2.2: a dialogue dataset with additional annotation corrections and state tracking baselines. In: Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, ACL 2020, pp. 109–117 (2020) 29. Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., Weston, J.: Personalizing dialogue agents: i have a dog, do you have pets too? CoRR abs/1801.07243 (2018). http://arxiv.org/abs/1801.07243 30. Zhang, Y., et al.: DialoGPT: large-scale generative pre-training for conversational response generation. CoRR abs/1911.00536 (2019). http://arxiv.org/abs/ 1911.00536
Research Methods for Fake News Detection in Bangla Text A. S. M. Humaun Kabir(B) , Alexander Alexandrovich Kharlamov, and Ilia Mikhailovich Voronkov Department of Intelligent Information Systems and Technologies, Moscow Institute of Physics and Technology, Dolgoprudny, Russia [email protected]
Abstract. This research work focuses on the fake news classification in Bangla language using deep neural networks and machine learning classification algorithms processing the text data in the prominent way. Bangla language is the fifth most spoken native language in the world with approximately over 300 million native speakers and another 50 million as second language speakers. In this work, news collected from different online and print newspapers are classified in authentic and fake news class. Considering prominent natural language processing techniques, data preprocessing and performing different deep neural networks and machine learning classification algorithms, it has achieved a maximum of 81% accuracy for fake minor class and 99% accuracy for classifying overall fake news and authentic news. Keywords: Fake News · Bangla · Natural Language Processing · Neural Networks · Low Resource Language
1 Introduction News is a piece of highlighted information about events that take place around us. News is generally being published through printed newspapers and broadcasted on television or radio. Previously the number of reach was lower due to the limited transmission. Nowadays alongside the printed or broadcasted news, it is being shared on the internet which can persist for a lifetime and spreads over the world in a second. This faster transmission of news has given a new dimension to the news world. Online versions of the newspapers are very crucial and trendy to the readers due to easy accessibility. While this revolution has been happening there is a major issue of faking information or misleading information through news. The misleading news or fake information is generally tagged as fake news. Fake news has been an issue all over the world for its impact on social, political, economical and diplomatic life. Defamation, communal violence, and biased elections are the common negative impact of fake news. The accessibility of technology has accelerated the spread of news in seconds over the world. In terms of information transmission that’s a very good thing but to the other side of the coin it turns out to be fake sometimes that impacts most to human life in different domains. The negative impact © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 54–60, 2023. https://doi.org/10.1007/978-3-031-44865-2_6
Research Methods for Fake News Detection in Bangla Text
55
of fake news can be reduced by flagging and categorizing them to the mass readers so that they can be aware of not being influenced by them. Problem Statement. There are lots of incidents that have happened because of fake news such as suicide, health risk, communal violence, political biases and economical imbalance. The fake news published by few newspapers spreads faster over the internet because of their accessibility online and through social media channels. The authenticity of the news cannot be identified by a reader instantly due to lack of data sources and the news is manipulated such a day with data that it seems real news. A news article can be flagged as fake or real for the betterment of the ethics of the newspaper and as well for the reader who have been impacted by the fake news vastly. To address this situation we need a lot of data from real newspapers and news which are tagged fake by incidents and have no existence in the real world. To address this issue, classifying fake news is a crucial domain of research and development.
2 Related Work Researchers came up with ways to reduce the fabrication and spread of fake news which negatively impacts human life. Different ways are approached by researchers to identify fake news and detect them automatically and filter them out from real information. In the past few years the work is in good progress and works in the past few years which mostly used Support Vector Machine (SVM) [1] to detect fake or satirical news from the sources and they have an accuracy greater than ninety percent. There are some works on Convolutional Neural Network (CNN) and BiLSTM also which have a good accuracy on dataset collected from different newspapers [2]. Recent times, Multinomial Naive Bayes (MNB) and Support Vector Machine (SVM) classifiers to detect Bangla fake news with CountVectorizer and Term Frequency - Inverse Document Frequency Vectorizer as feature extraction [3]. Few researchers presented some very comprehensive review of identification of fake news on social media, newspapers, including social theories, existing algorithms from the perspective of data mining, evaluation metrics and dataset representation [4]. Fact checking and cross verification of information to detect fake news using CNN along with the effectiveness of Ti-CNN is also introduced [5]. A very recent work on fake news dataset was created by a group of researchers which has around 50K of unprocessed data of authentic news and fake news which is called BanFakeNews [6]. BERT by Google provides a pretrained multilingual cased model which is trained for 104 languages [7]. [8] this work focused on naive bayes classifier to detect fake news data collected from facebook and other social networks, the work avoided the punctuation errors, and an overall 74% accuracy is acquired [9]. These researchers followed various ML technique to detect fake news, the accuracy of different predictive patterns such as gradient enhancement, support vector machine were taken, and an overall accuracy of 85% acquired [10]. in this research, researchers have taken data from social media and the accuracy was very low because of the unauthentic sources of data, they proceeded with naive bayes classifier to classify fake news. In [11] this research work, machine learning are used to classify fake news, they tried through naive bayes, neural networks and support vector machines, and achieved a highest accuracy of 99% for support vector
56
A. S. M. H. Kabir et al.
machines on English language dataset [12]. Tried to classify fake news using n-gram and support vector machine and achieved 92% accuracy. Though the mentioned works are mostly on english language dataset. In Bangla there are few works we can see in the section. Apart from the mentioned a few state-of-the-art works in fake news detection we want to mention [13]. Has achieved accuracy of 85% using random forest classifier for Bangla fake news detection [14]. Worked on classification of Bangla fake news by using support vector machine SVM and achieved an accuracy of 73.02%, they used sentiment as a feature for Bangla text [3]. Used Multinomial Naive Bayes (MNB) and Support Vector Machine (SVM) for classification of Bangla fake news and achieved an accuracy of 96.64% for SVM and 93.2% for MNB [15]. Used realtime rumors and news dataset from kaggle and used novelty-based characteristics to detect them and achieved an accuracy of 74.5%, though they mentioned that unreliable source of data is the reason for low accuracy.
3 Methodology In this work to classify fake news for Bangla language neural networks and machine learning classification models are trained in the way of natural language processing. The data preprocessing is done by natural language processing techniques so that the data fit into the models give better performances. In [16] researchers studied that the Fasttext word embeddings works better for non-English languages than the Word2Vec, taking into account Bangla is a non-English language Fasttext is considered. In word embeddings words are represented by a real-valued vector with hundreds of dimensions. As a dataset, a hybrid dataset is made from different sources of data, mainly the BanFakeNews [6] dataset is used for training the models. Validation and testing dataset is completely different from the training dataset as it is splitted and processed uniquely. Also, from the collected open source datasets, several files were merged and shuffled, containing both fake news and real news data to perform the training, validating and testing. The concatenated dataset contains columns (articleID, domain, date, category, headline, content, label) but during training other columns were omitted except content and label. As part of the preprocessing punctuations were removed. Punctuations are some marks that help to elicit the correct meaning of a sentence placed in different places such as comma, semicolon. Though punctuations are important but in terms of lexical and sentimental use they have no important role. Tokenization has been made to process the Bangla text in the assistance of keras_preprocessing and Bangla Natural Language Processing Toolkit [17] also known as BNLP Toolkit for Bangla text. As well as sequencing has been performed with the help of keras_preprocessing and post padding is performed in this work that chopped off longer sentences from the last (Figs. 1 and 2). Deep neural networks and machine learning classification models are used to classify fake news from the huge dataset containing real news and fake news. Supervised learning is followed through the work. In Fig. 3. it shows how the work has been done step by step. Collecting dataset for this task is an important thing as being a low resource language it is hard to find dataset for Bangla language. Along with that, processing the dataset is crucial as most of
Research Methods for Fake News Detection in Bangla Text
57
Fig. 1. SkipGram + Subword Information used in Fasttext for Predicting Target Word where n = 2.
Fig. 2. SkipGram Architecture for Predicting Target Word in Word2Vec.
Fig. 3. Process of Fake News Detection of Bangla Newspaper Data.
the available resources are not up to the mark for non-English languages. Considering programming languages, libraries to pretrained models, everything is targeted on mostly English and later on progressed through non-English languages.After the preprocessing is done the model is implemented, trained, validated and tested to obtain expected results. Machine Learning Classification Models. Support Vector Machine SVM, 2. Logistic Regression LR, 3. Random Forest RF 4. Decision Tree DT, 5. K-th Nearest Neighbours KNN, 6. Gradient Boosting Classifier GBC, 7. Gaussian Naive Bayes GNB.
58
A. S. M. H. Kabir et al.
Neural Network Models. Convolutional neural network CNN, 2. Long Short-Term Memory LSTM, 3. Bidirectional Encoder Representations from Transformers BERT. In case of BERT a pretrained model for Bangla is tuned [18]. State-of-the-art works and prominent literature reviews are applied to the proposed solution for Bangla Natural Language Processing. In case of being a low resource language, Bangla natural processing techniques are performed to classify the fake news as well as trained, tested and validated.
4 Result Analysis In this work to classify fake news for Bangla language neural networks and machine learning classification models is used in the way of natural language processing techniques. Micro-F1 scores are used to evaluate the models. Reporting the precision (P), recall (R), and F1 score for the minority class (fake) because the dataset is imbalanced. Precision, which is also known as positive predictive value, is the fraction of relevant instances among the retrieved instances and the recall which is known as sensitivity is the fraction of relevant instances that were retrieved. Both precision and recall are therefore based on relevance. Precision = Recall =
True Positive Total Predicted Positive True Positive Total Actual Positive
In statistical analysis of binary classification, the F-score or F-measure is a measure of a test’s accuracy. F1 = 2 ×
Precision × Recall Precision + Recall
Table 1 broadly represents the results of different models and from the highlighted scores BERT, SVM, RF, LSTM performed well for both the minor fake class and overall with a decently higher accuracy than other models. The neural network models BERT and LSTM have been fitted with finely preprocessed data along with word embeddings whereas the CNN underfitting during training gives lower result for minor fake class because of the imbalance dataset. In the case of SVM and RF the preprocessing and word embeddings were according to the mechanism of natural language processing; other machine learning classification was fitted in the same but performed with lower accuracy. Analyzing the whole result table, for our work BERT has performed outstanding for both minor fake class and overall dataset, then for neural networks LSTM has its place. On the other hand, the machine learning classification algorithm SVM and RF performs the same result and has achieved the same for minor fake class and overall accuracy.
Research Methods for Fake News Detection in Bangla Text
59
Table 1. Result of different models Model name
Precision (P)
Recall (R)
F1 Score (Fake Minor Class)
F1 Score
CNN
0.98
1.00
0.44
0.99
LSTM
0.98
1.00
0.65
0.99
BERT
0.99
0.99
0.81
0.99
SVM
0.98
1.00
0.54
0.99
LR
0.97
1.00
0.06
0.99
RF
0.98
1.00
0.54
0.99
DT
0.99
0.98
0.48
0.99
KNN
0.98
1.00
0.50
0.99
GBC
0.99
0.96
0.31
0.97
GNB
1.00
0.82
0.20
0.90
5 Conclusions In this era of frequent transmission of information, this work tried to classify the fake news from vast amounts of news information online. As the accessibility of the internet is increasing day by day so is the manipulation of information. The manipulated information harms the social, economical, political and diplomatic ground overall. If the fake news is classified and detected for the audience of the newspapers online, the negative impact of fake news will be reduced. This work focused on news data from various sources, followed the prominent literature review and has a good accuracy of classifying fake news with the help of different deep learning and machine learning models by following the techniques of Natural Language Processing NLP. After all the work has been done a decent accuracy has been achieved through different models. As the fake class is minor and data is imbalanced in the future work more fake labeled news will be introduced to make the dataset balanced along with collaboration. Consistent model architecture, deployment of models to verify fake news and a fake news verification system will also be introduced for end users later on as extension of the work.
References 1. Rubin, V., Conroy, N., Chen, Y., Cornwell, S.: Fake news or truth? using satirical cues to detect potentially misleading news. In: Proceedings of the Second Workshop on Computational Approaches to Deception Detection, pp. 7–17 (2016). https://doi.org/10.18653/v1/W16-0802 2. Hossain, E., Kaysar, M., Jalal Uddin Joy, A.Z., Rahman, Md.M., Rahman, M.: A Study Towards Bangla Fake News Detection Using Machine Learning and Deep Learning, pp. 79–95 (2021). https://doi.org/10.1007/978-981-16-5157-1_7 3. Hussain, M.G., Hasan, M.R., Rahman, M., Protim, J., Hasan, S.A.: Detection of Bangla Fake News using MNB and SVM Classifier (2020) 4. Shu, K., Sliva, A., Wang, S., Tang, J., Liu, H.: Fake News Detection on Social Media: A Data Mining Perspective (2017). arXiv. https://doi.org/10.48550/ARXIV.1708.01967
60
A. S. M. H. Kabir et al.
5. Yang, Y., Zheng, L., Zhang, J., Cui, Q., Li, Z., Yu, P.S.: TI-CNN: Convolutional Neural Networks for Fake News Detection (2023) 6. Hossain, M.Z., Rahman, M.A., Islam, M.S., Kar, S.: BanFakeNews: A Dataset for Detecting Fake News in Bangla (2020) 7. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2019) 8. Gilda, S.: Notice of Violation of IEEE Publication Principles: Evaluating machine learning algorithms for fake news detection. In: 2017 IEEE 15th Student Conference on Research and Development (SCOReD), pp. 110–115 (2017). https://doi.org/10.1109/SCORED.2017.830 5411 9. Jain, A., Kasbe, A.: Fake News Detection. 2018 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), 1–5 (2018). https://doi.org/10. 1109/SCEECS.2018.8546944 10. Yumeng, Q., Wurzer, D., Cunchen, T.: Predicting Future Rumours. In: Chinese Journal of Electronics, Vol. 27, p. 514 (2018). https://doi.org/10.1049/cje.2018.03.008 11. Kaur, P., et al.: Hybrid text classification method for fake news detection. Int. J. Eng. Adv. Technol. (IJEAT) 2388–2392 (2019) 12. Looijenga, M.S.: The detection of fake messages using machine learning. 29 Twente Student Conference on IT, Jun. 6th, 2018, Enschede, The Netherlands. Netherlands: essay.utwente.nl. (2018) 13. Islam, F., et al.: Bengali fake news detection. In: 2020 IEEE 10th International Conference on Intelligent Systems (IS), pp. 281–287. Varna, Bulgaria (2020). https://doi.org/10.1109/IS4 8319.2020.9199931 14. Tohabar, M., Nasrah, N. Mohammed, A.: Bengali Fake News Detection Using Machine Learning and Effectiveness of Sentiment as a Feature, pp. 1–8 (Aug 2021). https://doi.org/ 10.1109/ICIEVicIVPR52578.2021.9564138 15. Khanam, Z.: Analyzing refactoring trends and practices in the software industry. Int. J. Adv. Res. Comput. Sci. 10, 0976–5697 (2018) 16. Gromann, D., Declerck, T.: Comparing Pretrained Multilingual Word Embeddings on an Ontology Alignment Task (2018 May). [Online]. Available: https://aclanthology.org/L181034 17. Sarker, S.: BNLP: Natural language processing toolkit for Bengali language (2021) 18. Sarker, S.: BanglaBERT: Bengali Mask Language Model for Bengali Language Understanding (2020). https://github.com/sagorbrur/bangla-bert
On the Question of the Dynamic Theory of Intelligence Yuriy T. Kaganov(B) Bauman Moscow State Technical University, Moscow, Russia [email protected]
Abstract. The paper considers an approach based on nonlinear and symbolic dynamics for the analysis of cognitive processes. The formation of semantic structures as a result of self-organization processes in complex nonlinear systems is investigated. To implement the proposed approach, the theory of metagraphs and granular computations is used. The possibility of using the proposed approach for the study of cognitive processes of intelligent systems and the further development of artificial intelligence systems is shown. Keywords: Nonlinear Dynamic Systems · Metagraph Approach · Cognitive Processes · Artificial Intelligence
1 Introduction In the 70s of the XX century, a new paradigm of science was formed, which took shape in a scientific concept based on the concept of complexity of systems. This concept was based on the ideas of self-organization of complex nonlinear dynamical systems [1–5]. One of the main problems of the theory of self-organization of complex systems is the problem of information formation and the emergence of meaning [6, 7,]. This formulation of the problem translates it into a plane related to semiotics [8]. The possibility of such a transition is due to the development of symbolic dynamics methods, which is a powerful tool for analyzing complex nonlinear dynamical systems [9–11]. Further development of this approach was associated with the use of the concept of metagraphs and granular calculations [12–14]. This allowed us to analyze the dynamic processes that supposedly determine the activity of the brain and the emergence of consciousness [15, 16]. Further research in this direction can give a new impetus to the theory of artificial intelligence.
2 Basic Concepts and Models of Complex Systems A complex system is a multi-level system of interacting dynamic subsystems, each of which has a certain autonomy and at the same time an integral system is not reducible to the totality of its subsystems. Such a system can be represented as a system of autonomous agents, usually having some expediency of behavior. The expediency of the behavior of a complex system is manifested in the ability of the system to maintain its qualitative © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 61–71, 2023. https://doi.org/10.1007/978-3-031-44865-2_7
62
Y. T. Kaganov
certainty. This property determines the main goal of the system, which, in turn, forms the private goals of the subsystems. The main features of the organization of a complex system are the following: • Holonic (holarchic) organization of the structures of a complex system and the relative autonomy of each level of the corresponding structure. At the same time, each level of the system, being autonomous, exhibits the properties of a part of the whole. • Complex nonlinear dynamics of interaction of subsystems at each homogeneous structural level. • Emergence and emergence of new properties of subsystems at each higher level.
3 Dynamic System A very important characteristic of complex systems is their behavior dynamics. In this regard, the basic concepts associated with a nonlinear dynamical system can be defined in the following ways: Attractor (attracting set). An attractor is a mathematically limiting set (singular point, cycle, torus, etc.) for which the set of all trajectories asymptotically approaching this set is open. Bifurcation (branching of the solution of a dynamic equation). As the parameters of the dynamic system change, the attractors may change. At the same time, as a result of changes in some parameters, qualitative changes and their significant restructuring may occur. For example: a steady focus can be replaced by a limit cycle. Such parameter values are called bifurcation parameters, and the realignment itself is a bifurcation. Stability and the Boundary Between Regular and Chaotic Dynamics. The criterion of the boundary between regular dynamics and dynamic chaos is the stability of the system to small perturbations. This approach leads to the need to determine chaotic behavior through the sensitive dependence of the system on the initial conditions and the use of Lyapunov exponents and entropy as criteria for dynamic chaos. A dynamic system can be described either by a system of ordinary differential equations (in this case, the process described by such a system is called a flow): dxi = fi (x1 , x2 , . . . , xN ; μ1 , μ2 , . . . , μk ). dt
(1)
A discrete mapping system can also be used: i xn+1 = fi (xn1 , xn2 , . . . , xnN ; μ1 , . . . , μk ).
(2)
This process is called a cascade. In this case, xi (t)orxni are variables that determine the state of the system (phase coordinates of the dynamic system); fi (x, μ) are generally nonlinear functions; μ are parameters of the dynamic system. According to Cauchy’s theorem, the solution of a dynamical system exists only for given initial conditions (xi (0), x0i ) and smoothly
On the Question of the Dynamic Theory of Intelligence
63
depends on the change of the initial state. For the study of dynamical systems, the mathematical apparatus proposed at the end of the XIX century by the great mathematician and thinker A. Poincare proved to be extremely effective. His idea was to use the socalled sequence functions to study the phase trajectories of a dynamic system. First of all, with the help of this apparatus, it became possible to study dynamical systems from the point of view of the existence of attractors. The sequence function is used to study closed trajectories. The set of points S is a representation of a sequence of points of the phase trajectory of a dynamical system in N-dimensional space at the intersection of an N-1 dimensional manifold of a Poincare hypersurface in the process of its development in time. For the limit cycle, the set of intersections converges to a point (Fig. 1).
Fig. 1. The sequence function
The study of the dynamics of dissipative systems using the mediation function makes it possible to identify bifurcation processes of transition from one attractor to another. Depending on the type of attractor, either a regular mode of motion or a chaotic one, characteristic of strange attractors, can be implemented in a dynamic system. Let’s consider this process on the example of a dynamical system described by ordinary i differential equations dx dt = fi (x1 , x2 , . . ., xN ; μ1 , μ2 , . . . , μk ). Or, in general, by the equation x˙ = f (x, μ)
(3)
[9, 10], the control parameter μ determines the possibility of transition from one attractor to another (Fig. 2). Each of the attractors can be interpreted as a certain element of information. Therefore, when the vector of parameters μ changes, a transition from one attractor to another is possible (i.e., bifurcation). Thus, it is possible to obtain new elements of information. In particular, as an example of one of the strange attractors, to which, when the parameter μ changes, a dynamic system consisting of only three nonlinear ODES comes, the E. Lorentz attractor can be cited.
64
Y. T. Kaganov U(x,µ)
µµ*
x U(x,µ)
µµ*
x ∗ dx ∗ Fig. 2. Bifurcation of the system of differential equations dt = f x, μ , μ is the critical value
of the parameter.
E. Lorentz equations: x˙ = σ (y − x); y˙ = x(r − z) − y;
(4)
z˙ = xy − bz. For parameter values (vector μ) σ = 10; r = 28; b = 8 3, „ the system of equations describes a strange attractor with two main properties (Fig. 3): • fractal structure of the phase space; • stochasticity and unpredictability of the behavior of the phase trajectory.
On the Question of the Dynamic Theory of Intelligence
65
Fig. 3. The Lorentz attractor
4 Symbolic Dynamics Further development of research in the field of nonlinear dynamics led to the formation of tools with which it became possible to study the behavior of complex dynamical systems in more detail. Such tools turned out to be methods of symbolic dynamics [11]. In 1935, G. Birkhoff applied symbolic dynamics for the first time to encode trajectories near a homoclinic orbit. S. Smale applied the same technique in constructing the so—called “horseshoe” - a simple model of chaotic dynamics. Smale’s horseshoe had a significant impact on chaos theory, since this example is typical, and the methods of symbolic dynamics turned out to be the tool that allows us to describe the nature of deterministic chaos [12]. • The main idea of the method is to describe the dynamics of the system using permissible sequences of characters (permissible words) from a finite set of characters (alphabet). • The set of possible states of the system (phase space) is divided into a finite number of cells. Each cell corresponds to a “symbol” and at each moment of time the symbol that corresponds to the current state of the system appears. Thus, the phase trajectory can be represented as a system of symbols, and the attractor as a certain word in the corresponding semiotic system (Fig. 4 and Fig. 5). In the works of G.S. Osipenko [13, 14], it is shown how the transition from a symbolic image to a loaded graph mapping is possible. This approach allows us to move from analyzing the nature of attractors based on Lyapunov exponents to a discrete graph representation of a complex dynamical system. The main idea of this approach is as follows. The phase space of a dynamical system is divided into cells Mi , an isomorphism is established between the cells the phase space and the vertices vi of a planar multi of graph G = V , E, vi ∈ V , ejk ∈ E, where the vertex belongs to the set of vertices V, and the subset of edges ejk belongs to the set of directed edges E. Vertices j and k are connected by an oriented edge ejk only if the trajectory of the system in phase space provides for the possibility of transition from cell to cell. Thus, in the monograph [13] a two-level model is proposed and an isomorphism between the upper and lower
66
Y. T. Kaganov
Fig. 4. Sequence Poincare mappings
Fig. 5. Graph mapping of symbolic dynamics
levels is given. The lower level corresponds to the phase space of the dynamical system, and the upper level is a planar multigraph G, which contains information about possible transitions between the cells of the phase space of the dynamical system. The dynamic system under consideration may be hierarchical. Currently, such systems are increasingly attracting the attention of researchers. The term “hierarchical symbolic dynamics” is used to denote this approach [15]. The dynamic system under consideration can be the basis for several categorical features, which requires the use of several graphs or a more complex graph model that allows you to include several graphs. The construction of a graph model is based on the following considerations. Hypothesis. The shape of the phase trajectory adequately reflects (reveals) the key internal properties of nonlinear dynamical systems. Definition. The shape of the trajectory of a dynamical system is understood as some of its invariants that persist with homogeneous and smooth shifts and extensions. The coding of the shape of the trajectories of dynamic systems can be represented as follows. Formation of a symbolic image of a dynamic system. An oriented graph Gaϕ = (Vaφ, Eαφ) is constructed: Vaφ – vertices of the graph Gaϕ – symbols of states; Eαφ – edges of the graph Gaϕ – transitions between states. Statement. The symbolic Gaϕ-image reflects the global structure of the dynamical system {fk (x, μ), k ∈ K ⊂ Z} [13]. There is a correspondence between the trajectories of the system and the paths on the symbolic Gaϕ-image. Thus, it is possible to correlate the topological properties of the energy, i.e. the Hamiltonian H(q,p) to the symbolic image representing the oriented graph Gaϕ (Fig. 6).
On the Question of the Dynamic Theory of Intelligence
67
Fig. 6. Topology of the energy surface
5 Metagraph Models and Information The next stage of formalization is the transition to a meta-graph model that assumes the possibility of emergence, which allows us to describe the emergence of new levels of information formation. A metagraph is a holonically organized graph. MG = V , MV , E,
(5)
The vertex of the metagraph is characterized by a set of attributes. vi = {atrk }, vi ∈ V where vi is the vertex of the metagraph; atr k is an attribute. The edge of a metagraph is characterized by a set of attributes, a source and a destination vertex: ei = vS , vE , {atrk }, ei ∈ E, where ei is the edge of the metagraph; vS is the initial vertex (metavertex) of the edge; vE is the final vertex (metavertex) of the edge; atr k is the attribute (Fig. 7). Such a representation makes it possible to consider dynamic processes from the point of view of information formation. According to the definition of D.S. Chernavsky [7], information is a memorized choice of one option from several possible and equal ones. At the same time, the process of memorization can be interpreted as the formation of attractors in the process of functioning of a dynamic system. The amount of information determined by K. Shannon can be obtained based on the famous formula for a message containing N characters IN = −N
M i=1
pi log2 pi ,
(6)
68
Y. T. Kaganov
e7 mv1 e1 v1
e3
e8 v2
e4
mv2 v4
e2
v3
e5
e6 v5
mv3 Fig. 7. Metagraph model
where M is the number of letters in the alphabet, pi is the probability of occurrence of the i-th character in the message. In turn, information is closely related to the expediency of the behavior of a dynamic system. Such a goal may be a homeostatic state that allows the system to maintain its qualitative certainty when interacting with the external environment. That is, the formation of a global attractor, which determines the formation of attractors of a lower hierarchical level. Information related to the expediency of behavior can be evaluated based on the introduction of semantic information. Semantic information is closely related to the concept of a goal. For the first time this concept was introduced by A.A. Harkevich, who connected the value of information with the purpose of activity, suggesting that the Shannon entropy measure be considered as a measure of the probability of hitting the target, i.e. as a measure of expediency. In turn, Yu.A. Schrader proposed to build a theory of semantic information based on the concept of diversity, rather than the concept of removing uncertainty, and in particular, on the basis of taking into account such a property of information as the dependence of the received information on a priori. This can be expressed as follows P V = log2 , q
(7)
where P is the probability of achieving the goal after receiving information; q is the probability of achieving the goal after receiving information; V is the value of information depending on the information received that contributes to achieving the goal. A priori information q depends on the total amount of information, i.e. q = 2I . A posteriori information is determined by how close it gets to the goal. It can be both large and smaller than q. In the same sense, the Kullback-Leibler divergence DKL (q, p) can be interpreted N pi DKL (q, p) = pi log . (8) i qi This ratio is often used as an error estimate in the theory of artificial neural networks. The expediency of complex dynamic systems can be represented as a hierarchy of goals.
On the Question of the Dynamic Theory of Intelligence
69
This is especially evident in biological systems. The following system of hierarchical levels of expediency can be formulated. 1) Survival of the species. Formation and fixation of goals at the genetic level. 2) Survival of the organism (individual, agent). Meeting the needs for nutrition, reproduction and adaptation to the external environment. Formation of appropriate goals. 3) Preservation of homeostasis and training as a means of improving interaction with the external environment. 4) The formation of private goals that arise in the process of vital activity of the organism. From the general point of view of dynamic systems, such a hierarchical system of expediency can be interpreted as follows. Attractors are stable structures that can be interpreted as words built on the symbols that make up their phase space. Their dependence on the influence of the external environment is determined by changes in the parameters of dynamic systems, which entails the emergence of bifurcation processes and the formation of new attractors. This approach makes it possible to understand the essence of the processes of information formation and the emergence of semantics. The hierarchy of goals that form dynamic processes defines the hierarchy of different types and levels of information. The goal is a functional that determines how much the picture of the world of an intelligent system differs from the real picture of the world. It can be represented as a hypersurface having many extremes. At the same time, it can change in accordance with changes in the environment. The upper-level attractor is formed as a stable correspondence (minimum objective function) of the real world and the worldview of the intellectual system. The search for such a minimum can be carried out on the basis of a stochastic gradient, which allows taking into account not all variables, but only a partial set of them. In this regard, the problem of the emergence of semantics (i.e. meaning) is solved. Semantics arises when the impact of the external environment is assessed by the system on the basis of a certain scale determined by the goal [19]. As a result, a system of attractors is formed, which determines the behavior of the system. In turn, the attractors form the system’s response to external influences, which characterizes its pragmatics. The hierarchical structure of the attractors forms, in turn, a system of levels of languages consisting of various types of symbols. Each level has its own symbolism, determined by the environment in which the system of attractors is immersed and, accordingly, by the semantics determined by this environment. Interaction between languages occurs by forming a hierarchy of structures of the attractor system. The structure of the attractors at the highest level determines the nature of future reactions necessary for an adequate response to the impact of the external environment. This is the meaning of constructing a “PICTURE OF the WORLD” in a complex nonlinear dynamic system. This is done by creating top-level language structures that reflect, on the one hand, the most important properties of the environment for the system, and, on the other, allow forming the necessary reactions for the “survival” of this system in the environment. The concept of information can be related to the concept of granules, introduced by L. Zadeh in 1996.
70
Y. T. Kaganov
6 Conclusions One of the most exciting problems of our time is the problem of understanding the processes of cognition. It has deep roots both in philosophy and in scientific discourse. Cognitive processes are currently becoming the central topic of neurophysiology, psychology and the theory of artificial intelligence. Modern systems based on traditional methods of the connectionist theory of artificial neural networks have achieved a rapid rise in the last 4–5 years. Convolutional neural networks have made significant progress in the development of image recognition systems that leave human capabilities behind. Recurrent neural networks have provided the possibility of adequate translation from one language to another. Attention mechanisms and transformers have provided even more significant success to artificial neural networks. Deep neural networks like BERT and GPT built on their basis have become a revolution in this field. Deep neural networks based on reinforcement learning have become another area that has revolutionized artificial intelligence methods. However, it is becoming increasingly clear that the actual work of the brain is based on slightly different principles [21, 22]. The brain works differently than it is realized in models of modern neural networks. Firstly, the brain is a complex dynamic hierarchical system [23]. Secondly, each level of this system is a relatively autonomous semiotic structure [24]. Thirdly, dynamic structures that arise as a phenomenon of cognitive processes are strange attributes that have a fractal structure. As a result, the corresponding cognitive structures are fuzzy [23]. Fourth, at each level, its own objective function is formed, which determines the appropriate level of semantics.
7 Definition of the Concept of “Intelligence” Intelligence can be called the ability of highly organized systems to build models of the surrounding world and, in this regard, to predict future events (create models of the future), taking into account the hierarchy of goals that these systems generate and, in turn, to form an adequate response to these events. A highly organized system is understood as a biological system of any level (from a living cell to a multicellular organism), as well as an artificial system with these properties.
References 1. Haken, H.: Synergetics. Introduction. Nonequilibrium phase transitions and self-organization in physics, chemistry and biology, p. 375. Springer (1977) 2. Haken, H.: Advanced synergetics. Instability of hierarchies of self-organizing systems and devices, p. 384. Springer (1983) 3. Prigozhin, I.: Introduction to the thermodynamics of irreversible processes, p. 254. John Wiley, New York (1967) 4. Nikolis, G., Prigozhin, I.: Self-organization in nonequilibrium systems. From dissipative structures to the Oder through fluctuations, p. 520. John Wiley, New York (1977)
On the Question of the Dynamic Theory of Intelligence
71
5. Eigen, M., Schuster, P.: Hypercycle. The principle of natural self-organization, p. 275. Springer (1977) 6. Volkenstein, M.V.: Entropy and information. – M.: Nauka, p. 192 (1986) 7. Chernavsky, D.S.: Synergetics and information. Dynamic information theory. - M.: Editorial URSS, p. 288 (2004) 8. Haken, H., Portugali, J.: Information adaptation: the interaction of Shannon and semantic information in cognition. Springer, Berlin, Germany (2015) 9. Mainzer, K.: Complex system thinking. Matter, mind, humanity. Novy sintez. – M.: Book House “LIBROCOM”, p. 464 (2009) 10. Loskutov, A.Yu., Mikhailov, A.S.: Fundamentals of the theory of complex systems. – M.Izhevsk: SIC RCD, p. 620 (2007) 11. Williams, S.: Symbolic dynamics and its applications. American Mathematical Society (2002) 12. Bowen, R.: Methods of symbolic Dynamics. M.: Mir (1979) 13. Osipenko, G.: Dynamical systems, graphs and algorithms. Springer-Verlag (2007) 14. Osipenko, G.S.: Evaluation of Lyapunov exponents by symbolic analysis methods. Dynamic Systems 6(34), 15–35 (2016) 15. Akintayo, A., Sarkar, S.: Hierarchical symbolic dynamic filtering of streaming non-stationary time series data (2017). http://arxiv.org/pdf/1702.01811v1 16. Johnson, J.: Hypernetwork in the science of complex systems. Imperial College Publishing House. London, p. 330 (2013) 17. Basu, A., Blanning, R.: Metagraphs and their applications, p. 18. Springer, New York (2007) 18. Chernenky, V.M., Gapanyuk, Yu.E., Revunkov, G.I., Terekhov, V.I., Kaganov, Yu.T.: Metagraph approach for the description of hybrid intelligent information systems. Applied Informatics 3(69), 12 (2017) 19. Nikolis, J.S.: Dynamics of hierarchical systems. Springer Verlag, An evolutionary approach (1986) 20. Tarasov, V., Kaganov, Y., Gapanyuk, Y.: A metagraph model for complex networks: definition, calculus, and granularity problems. In: Proceedings of the XIX International Conference on Data Analysis and Management in Data-Intensive Areas (DAMDID/RCDL 2017), pp. 342– 349. Moscow, Russia (2017) 21. Hawkins, J., Blakesley, S.: About intelligence – M.: I.D. Williams LLC, p. 240 (2007) 22. Hawkins, J.C.: A Thousand Brains. A new theory of intelligence, p. 243. Basic Books, New York (2021) 23. Kaganov, Y.T., Gapanyuk. Y.E.: Nonlinear dynamics and the origin of cognitive processes of intelligent systems. Collection of articles of the First International Scientific and Practical Conference “Bionics-60 years. Results and prospects”, pp. 41–52 24. Bodyakin, V.I.: Neurosemantics. Information and control systems. Artificial intelligence: Scientific works/Comp. M.Yu. Lednev. – M.: Academic project; Mir Foundation, p. 805 (2020)
Offline Deep Reinforcement Learning for Robotic Arm Control in the ManiSkill Environment Huzhenyu Zhang1(B) and Dmitry Yudin1,2 1
2
Moscow Institute of Physics and Technology, Moscow, Russia [email protected], [email protected] AIRI (Artificial Intelligence Research Institute), Moscow, Russia
Abstract. Offline reinforcement learning (Offline RL) has been widely used in robot control tasks, while online reinforcement learning is often abandoned due to its high interaction cost with environment. In order to face this situation, Behavior Cloning (BC), an approach of Imitation learning (IL), is often considered a suitable choice for solving this problem, in which the agent could learn from a offline dataset. In this study, we propose a intuitive way, in which we add a Proximal Policy Optimization (PPO) loss as a correction term to the BC loss. The models are trained on a static dataset with four different robotic arm control tasks given by the ManiSkill environment. We demonstrate a comparison of the proposed approach with other existing Offline RL algorithms.
Keywords: offline reinforcement learning control · ManiSkill simulator
1
· robot manipulator
Introduction
Reinforcement learning (RL) is a field of machine learning in which an agent interacts with the environment following a policy. This paradigm can be extremely costly in environments where online interaction could be more practical because of the cost or the dangers of data collection. While in the Offline RL setting, the agent cannot interact with the environment. Instead, it can only learn from the static dataset collected by a behavior policy [1]. As an off-policy algorithm, PPO uses importance sampling to update the policy from the data collected by the old policy, which causes the problem of out-of-distribution (OOD). Trust Region Policy Optimization (TRPO) [2] firstly guarantees the monotonic improvement of performance, and Behavior Proximal Policy Optimization (BPPO) [3] can improve behavior policy with a monotonic performance guarantee using offline dataset. Simply put, monotonic performance guarantees make the new policy constantly better during the policy iteration process. Inspired by these works, we introduce a new reinforcement learning c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 72–80, 2023. https://doi.org/10.1007/978-3-031-44865-2_8
Offline Deep Reinforcement Learning for Robotic Arm Control
73
method combined with BC [4] and PPO to perform complex manipulator tasks in the ManiSkill [5] simulation environment. From the perspective of robot control, this is an end-to-end method, with the observed point cloud as input and the control signals of each joint as output. We conduct experiments in the ManiSkill simulation environment. In summary, our study makes the following main contributions: – We propose a novel offline reinforcement learning method, which combines the loss function of PPO and BC for training various deep neural networks: PointNet, Transformer-based model, and multilayer perceptrons. – We have studied the performance of the proposed approach for solving manipulation tasks in the modern ManiSkill simulator and compared it with other existing RL algorithms.
2
Related Work
Offline Reinforcement Learning. Offline RL, as an RL method that does not need to interact with the environment, has demonstrated powerful capabilities in the field of robot control in the past few years [6–8]. Since Offline RL is a technique for learning on a fixed dataset, there are many problems in the learning process, such as unseen state-action pair and distribution shift. Different datasets have different characteristics, different compositions, and different coverages, which cause the performance of even the same algorithm to be very different [9]. An article [10] proposed a measure of the quality of the dataset and found that the more diverse the types of states contained in the dataset, the better the dataset. Since RL is essentially a sequential decision-making problem, there are more and more cases of combining Transformer-based methods with Offline RL [11,12]. Action generation using offline reinforcement learning based on memory transformers is discussed in [13]. Imitation Learning. Imitation Learning is a special kind of Offline RL, which aims to learn from demonstrations {τ1 , τ2 , . . . , τm }. Each demonstration is a sequence of transitions (state-action pairs) generated by experts in the target task [14]. One of the RL trajectories is τi = , while manipulator control can be regarded as a partially observable Markov decision process (POMDP) problem. Such a ManiSkill environment uses a point cloud or RGB-D image as the observations of the manipulator instead of the state. So one POMDP trajectory could be expressed as τi = . We filter out the reward part of the trajectory to obtain the dataset with only observation-action pairs D = {(o1 , a1 ) , (o2 , a2 ) , (o3 , a3 ) , . . .} for IL training by simply sampling from this dataset and compute the MSE loss function for predicting actions. Manipulation Benchmarks. There are already many out-of-the-box manipulation simulation benchmarks, such as ManiSkill-Bench [15], VIMA-Bench [5], and others that are not limited to robotic arm control. VIMA-Bench provides
74
H. Zhang and D. Yudin
17 meta tabletop tasks and a large number of object shapes and textures. Meanwhile, it provides a massive imitation dataset with 650K successful trajectories and multimodal prompts to learn general robot manipulation [5]. ManiSkillBench releases a simulation environment for four tasks: OpenCabinetDoor, OpenCabinetDrawer, PushChair, and MoveBucket, three different observation methods: state of manipulator, rgbd and point cloud. ManiSkill also provides a large number of high-quality demonstrations (about 36,000 successful trajectories, about 1.5M point cloud/RGB-D frames in total) [15]. Some corresponding algorithms in this benchmark have also been provided. Such the object manipulation task can be solved based on modular analytical approaches that recognize the pose of objects and build an optimal movement plan [16]. VIMA provides a framework for using the Cross-Attention mechanism to learn from multimodal prompts. ManiSkill provides a learning-from-demonstrations (LfD) baseline (Fig. 1).
Fig. 1. The manipulator in the Maniskill simulator performs tasks OpenCabinetDrawer, OpenCabinetDoor, PushChair and MoveBucket (lines from top to bottom).
3
Method
We combine PPO algorithm with BC algorithm and compare it with other reinforcement learning baselines. PPO is a policy-based reinforcement learning algorithm that solves policy directly by performing gradient descent of an objective function (maximizing future rewards). The BC algorithm is a reinforcement learning algorithm similar to supervised learning that directly solves the mapping from action to observation.
Offline Deep Reinforcement Learning for Robotic Arm Control
75
Policy Gradient. In reinforcement learning, we have an agent, which selects corresponding action a to execute in environmental state s according to the policy π at each timestep t. The policy π is composed of a neural network, and the parameters of this neural network is θ, so the policy under these parameters are expressed as πθ . For one trajectory τ = , at each time step t, given the state st (for Maniskill - pointcloud, rgbd or state of robot), the agent outputs an action at according to policy πθ (for Maniskill control signal of each joint), and obtains reward rt from the environment (artificially set in the simulator). The probability of one particular episode occurring is: pθ (τ ) = p (s1 ) πθ (a1 | s1 ) p (s2 | s1 , a1 ) πθ (a2 | s2 ) p (s3 | s2 , a2 ) · · · = p (s1 )
T
(1)
πθ (at | st ) p (st+1 | st , at ) ,
t=1
where p (st+1 | st , at ) is environment dynamics. For episode τ , the return obtained by an agent during the entire game is denoted by R(τ ). We want the expectation of the discounted cumulative return under the trajectory distribution to be as large as possible. ¯θ = R R(τ )pθ (τ ) = Eτ ∼pθ (τ ) [R(τ )], R(τ ) = γ t−1 r (st , at ) . (2) τ
t
¯ θ is the optiFor policy with given parameters π, maximizing expected return R mization objective of Policy Gradient: (3) maximize Eτ ∼pθ (τ ) [R(τ )] . θ
Suppose we now have N trajectories, and the length of the i − th trajectory is Ti , then the gradient of expected average Return is: ¯θ = ∇R
N Tn 1 R (τ n ) ∇ log pθ (ant | snt ) . N n=1 t=1
(4)
Proximal Policy Optimization. A disadvantage of PG is that parameters update slowly. This is because we must resample trajectories every time we update the parameters. PPO updates policy with the importance sampling [17] to reuse collected data and allows update at a single step. Since the PPO algorithm was proposed, there have been many different versions, such as TRPO [2], DPPO [18], batchPPO [19] etc. In this study we use PPO-Clip, whose optimization objective are: πθ (a | s) πθ π θk k A (s, a), g (, A (s, a)) , min (5) πθk (a | s) where
(1 + )A, g(, A) = (1 − )A,
A ≥ 0, A < 0,
(6)
76
H. Zhang and D. Yudin
where πθk is the old policy collecting data for the current update, is a small hyperparameter which roughly says how far away the new policy is allowed to go from the old, A is advantage estimator, k is the current count of updates. Offline RL can be regarded as the extreme case of off-policy RL, in which case merely learns from offline static datasets. Thus πθk (a | s) could be regarded as the policy used for collecting the dataset. For fixed s and a, he is a constant. Dataset. In pointcloud mode, the observation includes the state of the robot and its perception of the environment. State of robot is a 32-dimensional vector that describes the current state of the robot, including pose, velocity, angular velocity of the moving platform of the robot, joint angles and joint velocities of all robot joints etc. The perception of the environment consists of RGB values for each point, position for each point, and k task-relevant segmentation masks. There are 22 joints in the dual-arm robot and 13 for the single-arm robot in ManiSkill, so actions are the continuous PID control signals of each joint (13 or 22-dimensional vectors) [5]. Rewards are task-specific dense rewards. Feature Extraction from Point Cloud. In the dataset provided by ManiSkillBench, point cloud containing different semantic masks is the observation of the manipulator. And the output of the whole architecture is the continuous control signals of each joint of the manipulator. The feature extraction network is composed of PointNet for each mask and Transformer Encoder for obtaining pointnet embedding. PPO Architecture. The PPO algorithm contains two networks, the actor network and the critic network. The actor network outputs actions as a policy function, and the critic network outputs value function. In Fig. 2, PN+Transformer serves as pointcloud feature extraction in pointcloud mode for a single task. The obtained pointnet embedding is sent to the PPO policy network respectively. For the task OpenCabinetDrawer k = 3, which are robotic arm, handles of cabinet door and cabinet. With one for the background without any segmentation mask, and one for the entire point cloud, input to k + 2 PointNets with the same architecture. Then obtain a 256-dimensional point cloud embedding after passing through an Encoder Transformer. The actions involved in ManiSkill are continuous actions. Therefore we choose normal distribution as the policy function, and the policy network outputs the mean and variance of actions by two fully connected networks separately. It is worth noting that in the training process, we obtain actions by sampling from the normal distribution, but in the evaluation, the output of policy is the mean value of the trained normal distribution. Behavior Cloning. In the existing filtered observation-action pair dataset D = {(o1 , a1 ) , (o2 , a2 ) , (o3 , a3 ) , . . .}, the loss function of BC is the MSE loss value of each action. N Tn 1 2 lossBC = (at − a ˆt ) . (7) N n=1 t=1
Offline Deep Reinforcement Learning for Robotic Arm Control
77
After optimizing the training loss function: weighting the PPO loss function and the BC loss function: loss = lossBC + α · lossP P O .
(8)
State of Robot dim = 32
Point Cloud xyz (1200,3) rgb (1200,3) seg (1200, k=3)
Background
PointNet 0
Robotic arm
PointNet 1
Handle of a cabinet door
PointNet 2
Cabinet
PointNet 3
Entire Pointcloud
PointNet 4
Encoder-only Transformer
Point Cloud Embedding dim = 256
MLP Attention Pooling
MLP
BC Loss LossBC
Continious Action dim = 13 or 22
Reward
PPO Critic
Dataset
PPO Advantage Function
Output Continious Action
Latent Variable dim_z = 192
Diag Gaussian
Mean and Variance of Actions
PPO Loss LossPPO
Output Continious Action
Fig. 2. Scheme of the developed approach
4
Experiments
In the first experiment, we compared the ability to extract features of PointNet with a Transformer and without it. In the second experiment, we fix the architecture of the neural network of PointNet+Transformer, adjust the weight of the PPO correction term, and observe the success rate of the task in the ManiSkill environment. The training dataset is an expert demonstration given by MankiSkill-Bench. The mean success rates are calculated over 100 evaluation trajectories. Experimental Environments in ManiSkill. OpenCabinetDrawer-v0 includes OpenCabinetDrawer 1045-v0. OpenCabinetDrawer 1045-v0 is an environment that randomizes the target drawer (of cabinet No.1045) every time the environment is reset. One cabinet can have multiple doors/drawers. OpenCabinetDrawer 1045 link 0-v0 limits the target drawer to the one that matches 0.
78
H. Zhang and D. Yudin
So we calculate the average success rate of the algorithm for the single object task in OpenCabinetDrawer 1045 link 0-v0, and for multiple objects in OpenCabinetDrawer-v0. 4.1
Experimental Results
As shown in Fig. 2, the dataset includes four parts: State of Robot, Pointcloud as observation, reward, and continuous action. And after obtaining Pointcloud Embedding through PointNet, enter BC loss function calculation and PPO policy update respectively. In the experiment Table 1, we verified the performance of different offline reinforcement learning methods in different point cloud feature extraction architectures, that is, PointNet with and without Transformer Encoder - PointNet (PN) and PN+Transformer. In the experiment Table 2, we compare the weights of different PPO loss functions and found that the smaller the weight of PPO loss, the higher the success rate. In the experiment Table 3, we fix the point cloud feature extraction network - PN+Transformer and the fixed PPO weight is 0.1 for comparison with other RL algorithms. Table 1. Baselines (BC [4], BCQ [20], TD3+BC [21]) using point-cloud based network architectures on training dataset. Each task is evaluated 300 times, and BC with PN+Transformer is taken over five different runs. Environments are: OpenCabinetDrawer-v0, OpenCabinetDoor-v0, PushChair-v0 and MoveBucket-v0. Architecture
PN
PN+Transformer
Algorithm
BC
BC
OpenCabinetDoor
0.15 0.28 ± 0.01 0.16
0.12
OpenCabinetDrawer 0.19 0.42 ± 0.02 0.22
0.22
BCQ TD3+BC
PushChair
0.08 0.18 ± 0.02 0.14
0.11
MoveBucket
0.05 0.15 ± 0.01 0.06
0.04
Table 2. The average success rates over 100 evaluation trajectories with different PPO loss weights of various feature extraction architectures. The average test success rates are calculated over environment OpenCabinetDrawer-v0 and OpenCabinetDrawer 1045 link 0-v0. Environment
Weight of lossP P O Architecture PN PN+Transformer
OpenCabinetDrawer 1045 link 0-v0 α = 0.1 α=1
0.65 0.65 0.32 0.3
α = 0.1 α=1
0.18 0.25 0.07 0.16
OpenCabinetDrawer-v0
Offline Deep Reinforcement Learning for Robotic Arm Control
79
Table 3. The average success rates over 100 evaluation trajectories compared with other RL algorithms using the PointNet+Transformer architecture. The average test success rates are calculated over environment OpenCabinetDrawer-v0 with α = 0.1. Algorithm
BC
BCQ TD3+BC PPO+BC
OpenCabinetDrawer-v0 0.33 0.17
4.2
0.42
0.25
Experiment Analysis
Table 1 shows that the Transformer architecture has a more vital ability to extract features from point cloud information. This is because the Transformer architecture can extract connections between point clouds. In the second experiment Table 2, we fixed the architecture of the neural network as PointNet + Transformer, adjusted the weight between the two loss functions and found that the larger the weight of the BC loss function, the higher the success rate. The reason for the analysis is that all datasets used are expert demonstrations. Using the BC algorithm can allow the agent to quickly learn the matching relationship between action and observation, but it is difficult for the PPO algorithm to improve policy based on this dataset. We compare our method with baselines in the last experiment Table 3. It outperforms BCQ but remains lower than BC and TD3+BC baseline.
5
Conclusion
In this study, we combine the PPO and BC algorithms and conduct experiments in the ManiSkill robotic arm environments, mainly taking the completion of the OpenCabinetDrawer task as the measure. Behavior cloning shows its undeniable power on such a deficient dataset, which is composed of only expert demonstrations. Adding the PPO correction term based on BC makes it outperform BCQ but remains lower than the BC and TD3+BC baseline. The policy iteration of the PPO requires a more extensive and diversified dataset. Otherwise, it is difficult for the PPO correction item to perform smooth policy iteration like the conventional PPO algorithm. In the future, we plan to evaluate our method on other robot control datasets. It is also advisable to study other reinforcement learning methods, including those based on transformer models. Acknowledgment. This work was partially supported by Russian Science Foundation, grant No. 20-71-10116, https://rscf.ru/en/project/20-71-10116/.
References 1. Fu, J., Kumar, A., Nachum, O., Tucker, G., Levine, S.: D4RL: datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219 (2020)
80
H. Zhang and D. Yudin
2. Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897. PMLR (2015) 3. Zhuang, Z., Lei, K., Liu, J., Wang, D., Guo, Y.: Behavior proximal policy optimization. arXiv preprint arXiv:2302.11312 (2023) 4. Pomerleau, D.A.: ALVINN: an autonomous land vehicle in a neural network. In: Advances in Neural Information Processing Systems, vol. 1 (1988) 5. Jiang, Y., et al.: VIMA: general robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094 (2022) 6. Andrychowicz, M., et al.: Learning dexterous in-hand manipulation. Int. J. Robot. Res. 39(1), 3–20 (2020) 7. Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J., Quillen, D.: Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int. J. Robot. Res. 37(4–5), 421–436 (2018) 8. Haarnoja, T., Ha, S., Zhou, A., Tan, J., Tucker, G., Levine, S.: Learning to walk via deep reinforcement learning. arXiv preprint arXiv:1812.11103 (2018) 9. Schweighofer, K., et al.: Understanding the effects of dataset characteristics on offline reinforcement learning. arXiv preprint arXiv:2111.04714 (2021) 10. Belkhale, S., Cui, Y., Sadigh, D.: Data quality in imitation learning. arXiv preprint arXiv:2306.02437 (2023) 11. Chen, L., et al.: Decision transformer: reinforcement learning via sequence modeling. In: Advances in Neural Information Processing Systems, vol. 34, pp. 15084– 15097 (2021) 12. Reed, S., et al.: A generalist agent. arXiv preprint arXiv:2205.06175 (2022) 13. Bessonov, A., Staroverov, A., Zhang, H., Kovalev, A.K., Yudin, D., Panov, A.I.: Recurrent memory decision transformer (2023) 14. Zhao, T., et al.: Skill disentanglement for imitation learning from suboptimal demonstrations. arXiv preprint arXiv:2306.07919 (2023) 15. Mu, T., et al.: ManiSkill: generalizable manipulation skill benchmark with largescale demonstrations. arXiv preprint arXiv:2107.14483 (2021) 16. Pushkarev, D., et al.: Door opening strategy for mobile manipulator with constrained configuration. In: Ronzhin, A., Meshcheryakov, R., Xiantong, Z. (eds.) ICR 2022. LNCS, vol. 13719, pp. 130–142. Springer, Cham (2022). https://doi. org/10.1007/978-3-031-23609-9 12 17. Precup, D.: Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, p. 80 (2000) 18. Heess, N., et al.: Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286 (2017) 19. Hafner, D., Davidson, J., Vanhoucke, V.: TensorFlow agents: efficient batched reinforcement learning in TensorFlow. arXiv preprint arXiv:1709.02878 (2017) 20. Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning without exploration. In: International Conference on Machine Learning, pp. 2052–2062. PMLR (2019) 21. Fujimoto, S., Gu, S.S.: A minimalist approach to offline reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 20132–20145 (2021)
Neuromorphic Computing and Deep Learning
Spiking Neural Network with Tetrapartite Synapse Sergey V. Stasenko(B) and Victor B. Kazantsev Lobachevsky State University of Nizhny Novgorod, Nizhny Novgorod 603022, Russia [email protected]
Abstract. In this paper we propose a model of a spiking neural network with a tetrapartite synapse. The proposed model explores astrocytes and the extracellular matrix’s role in neural network dynamics. It includes a tetrapartite synapse, providing realistic synaptic transmission representation with astrocytic modulation and ECM activation. The simulations show astrocytic modulation leads to neuron synchronization, while ECM activation causes desynchronization. These findings align with prior studies, suggesting astrocytes enhance synaptic transmission and promote synchronization, while ECM regulates neuronal activity. The model highlights the significance of non-neuronal cells and the ECM in understanding neural network dynamics, offering potential insights for neural disorders and therapeutic interventions. Future research can investigate underlying mechanisms and therapeutic targets. Keywords: computational neuroscience · spiking neural network · tetrapartite synapse · extracellular matrix of the brain · neuron - glial interaction
1
Introduction
Regulation of neuron dynamics is a complex process that involves a variety of cellular and molecular mechanisms. These mechanisms work together to ensure that neurons can adapt to changes in their environment and maintain proper function. Astrocytes are a type of glial cell found in the central nervous system that play a critical role in regulating neural activity. They provide support and maintenance to neurons, and have been shown to modulate synaptic transmission and plasticity, as well as influence neuronal excitability and firing patterns [1]. One of the ways that astrocytes regulate neural activity is through the uptake and release of neurotransmitters. Astrocytes express a variety of neurotransmitter receptors and transporters that allow them to monitor the levels of neurotransmitters in the extracellular space, and adjust their release accordingly. For example, astrocytes in the hippocampus have been shown to take up excess glutamate, a neurotransmitter that is involved in many aspects of brain function, and c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 83–92, 2023. https://doi.org/10.1007/978-3-031-44865-2_9
84
S. V. Stasenko and V. B. Kazantsev
release it in a regulated manner to prevent excitotoxicity and maintain proper synaptic transmission [2]. Astrocytes also play a role in modulating synaptic plasticity, the process by which neural connections are strengthened or weakened in response to activity. Through the release of signaling molecules such as ATP and D-serine, astrocytes can influence the activation of NMDA receptors, which are critical for synaptic plasticity [3]. In addition, astrocytes can directly influence neuronal excitability by releasing gliotransmitters, such as ATP, that can activate or inhibit ion channels in neurons. This can have important implications for neural circuit function and behavior [1]. Astrocytes are known to play an important role in the synchronization of neural networks. They can modulate synaptic transmission and regulate the activity of neural circuits by releasing gliotransmitters, such as ATP and D-serine, which can activate or inhibit ion channels in neurons. This can have important implications for the synchronization of neural activity and the generation of oscillatory patterns in the brain [4]. The extracellular matrix (ECM) plays an important role in regulating neural activity by providing structural support to neurons and modulating the activity of synapses. The ECM is a complex network of proteins and carbohydrates that surrounds cells and provides a physical scaffold for cell migration, adhesion, and communication. In the brain, the ECM is involved in many aspects of neural development, plasticity, and function [5]. One way that the ECM regulates neural activity is by modulating the formation and stability of synapses. The ECM contains a variety of proteins, such as thrombospondins and tenascins, which can bind to receptors on neurons and influence the formation and activity of synapses [5]. For example, the ECM protein tenascin has been shown to regulate the formation and plasticity of synapses in the hippocampus and cortex [6]. The ECM can also modulate neural activity by influencing the activity of ion channels and receptors on neurons. For example, the ECM protein hyaluronan has been shown to modulate the activity of ion channels involved in synaptic transmission and plasticity [7]. In addition, the ECM can bind to growth factors and other signaling molecules that can influence the activity of neurons and glia [5]. Astrocytes play an important role in regulating neuronal dynamics, and this phenomenon has been extensively studied using mathematical models based on experimental findings. One concept that has emerged from these studies is the “dressed neuron,” which describes how astrocyte-mediated changes in neural excitability can impact neuronal function [8,9]. Astrocytes have been suggested to act as frequency-selective “gatekeepers” and presynaptic regulators, with their gliotransmitters modulating presynaptic facilitation and depression [10,11]. The tripartite synapse model has been used to demonstrate how astrocytes participate in the coordination of neuronal signaling, particularly in spiketiming-dependent plasticity (STDP) and learning [12–19]. Biophysically detailed models and mean-field models have also been employed to study the astrocytic modulation of neuronal activity. These studies have revealed that functional gliotransmission is a complex phenomenon that depends on the nature of structural and functional coupling between astrocytic and synaptic elements [20–28].
Spiking Neural Network with Tetrapartite Synapse
85
Kazantsev et al. [29] proposed a mathematical model for the homeostatic regulation of neuronal activity by the extracellular matrix (ECM). The model utilized kinetic activation functions based on the Hodgkin-Huxley formalism [30] and showed that spikes in presynaptic neurons increase the average neuronal activity, ECM concentration, and synaptic weights. The model also revealed that proteases cleaving the ECM increase when the threshold value of the average neuronal activity is reached, leading to homeostatic regulation and a decrease in neuronal activity. A new model proposed for a spiking neuron network interacting with an active extracellular field mimicking the brain ECM [31]. It utilized synaptic scaling caused by ECM activity resulting in changes in synaptic weights. The simulation showed weak ECM activity led to population bursting, while strong activity led to more regular burst dynamics and higher firing rates. We propose a model of a spiking neural network with a tetrapartite synapse, where astrocytic modulation induces neuron synchronization, and ECM activation leads to desynchronization.
2 2.1
The Model Neuron Model
The neurons of the neural network were modeled using the modified Izhikevich model [32]: ⎧ dVi ⎪ ⎪ Cm = 0.04Vi2 + 5Vi + 140 − Ui + Iexti + Isyni + Ithri , ⎪ ⎪ dt ⎪ ⎪ ⎪ dU ⎪ ⎨ i = a(bVi − Ui ), dt (1) ⎪ if Vi ≥ 30 mV, then ⎪ ⎪ ⎪ ⎪ Vi = c, ⎪ ⎪ ⎪ ⎩U = U + d, i i In this context, the dynamics of the membrane potential Vi are influenced by parameters a = 0.02, b = 0.5, c = −40 mV, d = 100, while the auxiliary variable Ui accounts for the activation and deactivation process of potassium and sodium membrane channels, respectively. Additionally, there is an external current Iexti , whose values at the initial time point are randomly distributed within the range max = 41 mV/ms. When the membrane potential Vi reaches a threshold of 0 to Iext of 30 mV, it triggers the formation of an action potential (spike), resulting in a change in variable values. The term Isyni denotes the sum of synaptic currents received from all N presynaptic neurons: Isyni =
N j=1
yi,j wi,j ,
(2)
86
S. V. Stasenko and V. B. Kazantsev
Isyni is the sum of synaptic currents in the postsynaptic neuron, with wi,j denoting glutamatergic and GABAergic synapse weights between pre and postsynaptic neurons. N represents the number of presynaptic neurons connected to the j-th neuron. Excitatory and inhibitory synapses have positive and negative weights, respectively. yi,j is the output signal (synaptic neurotransmitter) from the i-th neuron affecting Isyni . Synaptic weights range from 20 to 30, randomly set. Presynaptic neuron spikes cause a sharp increase in postsynaptic synaptic current, followed by exponential decay. This spike induces changes in synaptic neurotransmitter concentration (yij ) following a given equation. yi,j dyi,j =− + by θ(t − tspi ), dt τy
(3)
where tspi determines presynaptic spike times, τy is the relaxation time constant, and by denotes neurotransmitter release during spikes. Equation (3) parameters: τy = 4 ms, by = 1. 2.2
Astrocyte Model
Mean-field approach is used to describe changes in gliotransmitter proposed in previous works [20,24]: βY dYj = −αY Yj + , dt 1 + exp(−ye + ythr )
(4)
Here, e = 1, 2, 3, . . . is the excitatory neuron index, Y represents gliotransmitter concentration near the excitatory synapse, and αY indicates the clearance rate. Model parameter values are αY = 120 ms, βY = 0.5, ythr = 3.5. In Eq. (4), the second term represents gliotransmitter production when the mean field concentration surpasses the threshold ythr . Astrocytes impact neurotransmitter release probability, causing synaptic potentiation or depression [33,34]. Our model includes potentiation for glutamatergic synapses, increasing neurotransmitter release probability. Mathematically, this is represented as: Isyni =
M
yj,i wj,i (1 −
j=1
γY ). 1 + exp(−Yj + Ythr )
(5)
In our model, Isyni is the sum of all synaptic currents from presynaptic neurons. Glutamatergic synapse weight is wj,i , and astrocyte influence is denoted by γY . We set the threshold for astrocyte influence at Ythr = 2. 2.3
Extracellular Matrix Model
We adopted the ECM dynamics approach proposed in prior works [29,31,35,36]. The model uses activity-dependent activation functions, akin to gating functions
Spiking Neural Network with Tetrapartite Synapse
87
in Hodgkin–Huxley formalism [30]. For computational efficiency, we utilized the reduced ECM dynamics model from [35,36], with key variables ECM and P represented by ordinary differential equations: ⎧ dq βq i ⎪ = −αq qi + ⎪ ⎪ ⎪ 1 + exp(−Vi /kq ) ⎨ dt dECMi (6) = −(αECM + γP P )ECMi + βECM HECM (qi ), ⎪ ⎪ dt ⎪ ⎪ ⎩ dPi = −α P + β H (q ) P i P P i dt The parameters αECM,p determine the rate of spontaneous degradation for ECM and proteases concentrations. Parameters βECM,p describe the rate of formation dependent on neuronal activity. We use sigmoid activation functions HECM,p [30,37] for ECM and proteases. Additionally, αq is the rate constant, βq is the scaling factor (0 < αq < βq ), and kq is the slope parameter (kq < 1). It is also known from experimental work that the extracellular matrix of the brain can change the threshold for generating an action potential [6,38]: Ithri = Ithri0 (1 − γECM ECM ),
(7)
where Ithri is the threshold for generating an action potential, γECM is the coefficient of ECM influence on the threshold, respectively. 2.4
Spiking Neural Network
The model consists of 300 neurons with a 4:1 ratio of excitatory to inhibitory neurons. Neurons have all-to-all connections with a 5% probability for glutamatergic synapses and a 20% probability for GABAergic synapses. The model suggests that astrocyte activity alters the probability of neurotransmitter release, affecting synaptic weights in glutamatergic synapses, while extracellular matrix activity modifies the threshold for generating spikes in excitatory neurons, maintaining homeostasis and preventing hyperexcitation. 2.5
Results
Let’s consider the effect of a tetrapartite synapse on neuronal activity. To accomplish this, we consider three cases: – The activity of the spike neural network without the influence of astrocytes and the extracellular matrix of the brain. – The activity of the spike neural network with the influence of astrocytes on synaptic transmission. – The activity of the spike neural network with the influence of astrocytes on synaptic transmission, involving potentiation, and the influence of the extracellular matrix of the brain on the excitability threshold of neurons (increased threshold).
88
S. V. Stasenko and V. B. Kazantsev
In the absence of any modulation, the neurons are in a tonic mode (upper Fig. 1). The evolution of the neuron’s membrane potential over time in the tonic
Fig. 1. These are raster diagrams and rate depicting neuronal activity in different cases: without modulation (top), with astrocytic modulation (middle), and with both ECM and astrocytic modulation (bottom).
Spiking Neural Network with Tetrapartite Synapse
89
Fig. 2. Evolution of the neuron’s membrane potential, Vi , over time in the tonic mode.
mode is shown in Fig. 2. Astrocytic potentiation of synaptic transmission leads to the formation of neural network synchronization (middle Fig. 1). The evolution of concentration of neuro, y, and gliotransmitters, Y , over time in the case of astrocytic modulation of synaptic transmission is shown in Fig. 3. As can be seen from the signal of population activity, a stable burst dynamics of neuronal activity is formed (Fig. 1). Such activity can underlie both physiological and pathophysiological processes. To correct the state of hyperexcitation of neurons, the brain’s extracellular matrix changes the threshold for generating spikes in excitatory neurons, which leads to the disappearance of synchronization and the maintenance of homeostasis (bottom Fig. 1). The evolution of concentration of extracellular matrix, ECM , and proteases, p, over time in the case of ECM modulation of the excitability threshold of neurons is shown in Fig. 4.
Fig. 3. Changes in the concentration of neuro, y, and gliotransmitters, Y , over time in the case of astrocytic modulation of synaptic transmission.
90
S. V. Stasenko and V. B. Kazantsev
Fig. 4. Changes in the concentration of extracellular matrix, ECM , and proteases, p, over time in the case of ECM modulation of the excitability threshold of neurons.
3
Conclusion
The proposed model is significant in understanding the role of astrocytes and extracellular matrix (ECM) in neural network dynamics. The tetrapartite synapse allows for a more realistic representation of synaptic transmission, including the involvement of astrocytes in synaptic scaling. The model’s simulation results suggest that astrocytic modulation can lead to neuron synchronization, while ECM activation can lead to desynchronization. This finding is consistent with previous studies indicating that astrocytes can enhance synaptic transmission and promote synchronization, while ECM can play a role in the homeostatic regulation of neuronal activity. Overall, the proposed model highlights the importance of considering the contribution of non-neuronal cells and the ECM in understanding neural network dynamics. The model’s results provide insight into how astrocytes and the ECM may influence neuronal synchronization, which could have implications for the study of neural disorders and the development of therapeutic interventions. Future studies could further investigate the underlying mechanisms of astrocytic modulation and ECM activation on neural network dynamics and explore the potential of these mechanisms as targets for therapeutic intervention. Acknowledgements. The study was supported by the Russian Science Foundation grant No. 22-71-00074.
References 1. Perea, G., Navarrete, M., Araque, A.: Tripartite synapses: astrocytes process and control synaptic information. Trends Neurosci. 32, 421–431 (2009). https://doi. org/10.1016/j.tins.2009.05.001 2. Danbolt, N.: Glutamate uptake. Prog. Neurobiol. 65, 1–105 (2001). https://doi. org/10.1016/s0301-0082(00)00067-8
Spiking Neural Network with Tetrapartite Synapse
91
3. Serrano, A., Haddjeri, N., Lacaille, J., Robitaille, R.: GABAergic network activation of glial cells underlies hippocampal heterosynaptic depression. J. Neurosci. 26, 5370–5382 (2006). https://doi.org/10.1523/JNEUROSCI.5255-05.2006 4. Poskanzer, K., Yuste, R.: Astrocytes regulate cortical state switching in vivo. Proc. Natl. Acad. Sci. 113, E2675–E2684 (2016). https://doi.org/10.1073/pnas. 1520759113 5. Dityatev, A., Schachner, M.: Extracellular matrix molecules and synaptic plasticity. Nat. Rev. Neurosci. 4, 456–468 (2003). https://doi.org/10.1038/nrn1115 6. Dityatev, A., Schachner, M., Sonderegger, P.: The dual role of the extracellular matrix in synaptic plasticity and homeostasis. Nat. Rev. Neurosci. 11, 735–746 (2010). https://doi.org/10.1038/nrn2898 7. Orlando, C., Ster, J., Gerber, U., Fawcett, J., Raineteau, O.: Perisynaptic chondroitin sulfate proteoglycans restrict structural plasticity in an integrindependent manner. J. Neurosci. 32, 18009–18017 (2012). https://doi.org/10.1523/ JNEUROSCI.2406-12.2012 8. Nadkarni, S., Jung, P.: Dressed neurons: modeling neural- glial interactions. Phys. Biol. 1, 35–41 (2004). https://doi.org/10.1088/1478-3967/1/1/004 9. Nadkarni, S., Jung, P.: Modeling synaptic transmission of the tripartite synapse. Phys. Biol. 4, 1–9 (2007). https://doi.org/10.1088/1478-3975/4/1/001 10. Volman, V., Ben-Jacob, E., Levine, H.: The astrocyte as a gatekeeper of synaptic information transfer. Neural Comput. 19, 303–326 (2006). https://doi.org/10. 1162/neco.2007.19.2.303 11. De Pitt` a, M., Volman, V., Berry, H., Ben-Jacob, E.: A tale of two stories: astrocyte regulation of synaptic depression and facilitation. PLoS Comput. Biol. 7, e1002293 (2011) 12. Postnov, D.E., Ryazanova, L.S., Sosnovtseva, O.V.: Functional modeling of neuralglial interaction. Biosystems 89, 84–91 (2007) 13. Amiri, M., Bahrami, F., Janahmadi, M.: Functional contributions of astrocytes in synchronization of a neuronal network model. J. Theor. Biol. 292C, 60–70 (2011) 14. Wade, J.J., McDaid, L.J., Harkin, J., Crunelli, V., Kelso, J.A.S.: Bidirectional coupling between astrocytes and neurons mediates learning and dynamic coordination in the brain: a multiple modeling approach. PLoS ONE 6, e29445 (2011) 15. Amiri, M., Hosseinmardi, N., Bahrami, F., Janahmadi, M.: Astrocyte-neuron interaction as a mechanism responsible for generation of neural synchrony: a study based on modeling and experiments. J. Comput. Neurosci. 34, 489–504 (2013). https:// doi.org/10.1007/s10827-012-0432-6 16. Pankratova, E., Kalyakulina, A., Stasenko, S., Gordleeva, S., Lazarevich, I., Kazantsev, V.: Neuronal synchronization enhanced by neuron-astrocyte interaction. Nonlinear Dyn. 97, 647–662 (2019). https://doi.org/10.1007/s11071-01905004-7 17. Stasenko, S., Hramov, A., Kazantsev, V.: Loss of neuron network coherence induced by virus-infected astrocytes: a model study. Sci. Rep. 13, 1–11 (2023) 18. Stasenko, S., Kazantsev, V.: Dynamic image representation in a spiking neural network supplied by astrocytes. Mathematics 11, 561 (2023) 19. Stasenko, S., Kazantsev, V.: Astrocytes enhance image representation encoded in spiking neural network. In: Advances in Neural Computation, Machine Learning, and Cognitive Research VI: Selected Papers From The XXIV International Conference On Neuroinformatics, Moscow, Russia, 17–21 October 2022, pp. 200–206 (2022)
92
S. V. Stasenko and V. B. Kazantsev
20. Gordleeva, S., Stasenko, S., Semyanov, A., Dityatev, A., Kazantsev, V.: Bidirectional astrocytic regulation of neuronal activity within a network. Front. Comput. Neurosci. 6, 92 (2012) 21. Pitt` a, M.: Gliotransmitter exocytosis and its consequences on synaptic transmission. In: De Pitt` a, M., Berry, H. (eds.) Computational Glioscience. SSCN, pp. 245–287. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-00817-8 10 22. Lenk, K., Satuvuori, E., Lallouette, J., Guevara, A., Berry, H., Hyttinen, J.: A computational model of interactions between neuronal and astrocytic networks: the role of astrocytes in the stability of the neuronal firing rate. Front. Comput. Neurosci. 13, 92 (2020) 23. Lazarevich, I., Stasenko, S., Kazantsev, V.: Synaptic multistability and network synchronization induced by the neuron-glial interaction in the brain. JETP Lett. 105, 210–213 (2017). https://doi.org/10.1134/S0021364017030092 24. Stasenko, S., Lazarevich, I., Kazantsev, V.: Quasi-synchronous neuronal activity of the network induced by astrocytes. Procedia Comput. Sci. 169, 704–709 (2020) 25. Barabash, N., Levanova, T., Stasenko, S.: STSP model with neuron-glial interaction produced bursting activity. In: 2021 Third International Conference Neurotechnologies and Neurointerfaces (CNN), pp. 12–15 (2021) 26. Stasenko, S., Kazantsev, V.: 3D model of bursting activity generation. In: 2022 Fourth International Conference Neurotechnologies and Neurointerfaces (CNN), pp. 176–179 (2022) 27. Barabash, N., Levanova, T., Stasenko, S.: Rhythmogenesis in the mean field model of the neuron-glial network. Eur. Phys. J. Spec. Top. 232, 529–534 (2023). https:// doi.org/10.1140/epjs/s11734-023-00778-9 28. Olenin, S., Levanova, T., Stasenko, S.: Dynamics in the reduced mean-field model of neuron-glial interaction. Mathematics 11, 2143 (2023) 29. Kazantsev, V., Gordleeva, S., Stasenko, S., Dityatev, A.: A homeostatic model of neuronal firing governed by feedback signals from the extracellular matrix. Public Library of Science, San Francisco (2012) 30. Hodgkin, A., Huxley, A.: A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. 117, 500 (1952) 31. Stasenko, S., Kazantsev, V.: Bursting dynamics of spiking neural network induced by active extracellular medium. Mathematics 11, 2109 (2023) 32. Izhikevich, E.: Dynamical Systems in Neuroscience: The Geometry of Excitability and Bursting, 1st edn., p. 441 (2007) 33. Perea, G., Araque, A.: Astrocytes potentiate transmitter release at single hippocampal synapses. Science 317, 1083–1086 (2007) 34. Fiacco, T., McCarthy, K.: Intracellular astrocyte calcium waves in situ increase the frequency of spontaneous AMPA receptor currents in CA1 pyramidal neurons. J. Neurosci. 24, 722–732 (2004) 35. Lazarevich, I., Stasenko, S., Rozhnova, M., Pankratova, E., Dityatev, A., Kazantsev, V.: Activity-dependent switches between dynamic regimes of extracellular matrix expression. PLoS ONE 15, e0227917 (2020) 36. Rozhnova, M., Pankratova, E., Stasenko, S., Kazantsev, V.: Bifurcation analysis of multistability and oscillation emergence in a model of brain extracellular matrix. Chaos, Solitons Fractals 151, 111253 (2021) 37. Sterratt, D., Graham, B., Gillies, A., Willshaw, D.: Principles of Computational Modelling in Neuroscience. Cambridge University Press, Cambridge (2011) 38. Dityatev, A., Rusakov, D.: Molecular signals of plasticity at the tetrapartite synapse. Curr. Opin. Neurobiol. 21, 353–359 (2011)
SAMDIT: Systematic Study of Adding Memory to Divided Input in the Transformer to Process Long Documents Arij Al Adel(B) Moscow Institute of Physics and Technology, Dolgoprudny, Russia [email protected]
Abstract. Recently, processing long documents using a transformer has attracted the attention of the research society. This paper suggests a method to process long documents, which combines segmentation, unified relative positional encoding for slot tokens, masking technique, and additional memory slots related to segments. The main results show onpar performance for using our method to process long documents with the baseline and superior performance compared to SLED. Memory content was studied and analyzed and revealed extensive interaction between memory slots and related chunk. Keywords: memory attention
1
· transformer · summarization · long input ·
Introduction
Processing long documents were handled by different approaches, using sparse attention methods and segmentation methods. To date, no one of the proposed efficient transformers outperformed their short-range counterparts on the same length, and their benefit is still the ability to process longer input. Preserving the main idea intact while text compression a source text is the gist of the summarization task. From the perspective of a chronological timeline, we can list most attention-based models that handle processing long documents and at the same time had an application for summarization tasks. This work [4] used a unique approach for summarization. One of the variants is DANCER PEGASUS. This work primarily aims to a) process long inputs and b) generate long outputs. This approach depends on special processing of the input and target more than attention. Hence, it divides the source and target and relates each part from the source with a part from the target. Finally concatenates the final output by processing each part from the input separately. In other words, it divides the long input and long target into multiple smaller source-target, and after that, it concatenates the resulting targets. PEGASUS [19] is a pre-training method sized for summarization tasks. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 93–101, 2023. https://doi.org/10.1007/978-3-031-44865-2_10
94
A. Al Adel
In 2020 BigBird attention [18] was proposed and used in both encoder-based transformers for MLM and QA tasks and encoder-decoder-based transformers for generation tasks (QA and Summarization tasks). BigBird was a logical extension of ETC [1], except it added the usage of random attention into block attention and global memory tokens. In this work, they have proved theoretically and practically that using sparse attention gives comparable results. Using sparse attention is not free to get better results. We need to add additional layers, which adds further cost, so it is not cheaper than using tense attention. They applied this attention to the summarization task using the BIGBIRD in the encoder side of the encoder-decoder model based on the assumption that the output length is small compared to the input length. They utilized the PEGASUS [19], which is a pre-training method originally curated for the summarization pre-training objective. In 2022 there were several works to process long documents and have an application on summarization tasks like LongT5 in May [6]. LongT5 was built based on T5. The main difference between T5 and LongT5 is that both models have used two different types of attention; local and transient. [1] used sparse sliding window attention as in ETC [1]. Transient attention is very similar to our proposed model. The memory tokens in the LongT5 are generated on the fly by averaging separate chunks. This generated memory is shared or the same for all chunks. In our model, each chunk has its memory. This memory was updated during training. While longT5 [6] took the input as a whole and depended on dividing and processing the whole tokenized input inside the model, SLED [10] divided the tokenized input outside the model encoder into imbricated chunks. Also, SLED [10] used the idea of fusion in decoder [11] that was used originally for question answering task. SLED [10] is designed using software engineering efforts more than changing the internal design of the model. This prevents SLED from benefiting from additional masking approaches or using different positional encoding methods. SLED is not more than a wrapper for ready-to-use models such as BART [12] and T5 as mentioned in the paper. In this paper, we introduce our work T5memModel which is based on different used ways such as masking, long inputs segmentation, adding global memory using unified positional encoding for memory tokens related to specific chunks, and fusion in the decoder. We initialized the proposed model from a pre-trained T5 model to avoid the cost of the pre-training stage. We trained the model on the summarization task using three datasets for different document lengths from normal to long (SAMSum, CNN/DailyMail, and GovReport). The experiments showed competitive results with the T5 baseline and superior results with SLED as another model for processing long documents. Our paper includes an overview of positional encoding for long documents in the transformers, an introduction to the proposed approach, experiments setup, data sets, and finally the results1 . 1
For more information and details about memory content and attention maps and small overview about positional in transformer supplementary material is provided.
SAMDIT
2
95
Positional Encoding for Long Documents
What about positional encoding in long documents? Were the types of positional information2 in transformer helpful? We need to look into the positional encoding used for works that already processed long text inputs to answer this question. Many previous works have already handled long inputs, and here we introduce a list of them: – Transformer-XL [3] proposed a novel positional encoding scheme for language model (decoder-based transformer). They made the following modifications to the attention formulation used by [17]: A = Eq W (q) W (k,e)T ETk +UW (k,e)T ETk +Eq W (q) W (k,r)T R+VW (k,r)T R (1) where: • Eq W (q) W (k,e)T ETk represents content-based addressing between the current segment and the previous segment; W (k,e)T content-based key vectors for key (current segment and previous segment) embedding projection. • UW (k,e)T ETk governs a global content bias between the current segment and the previous segments. U ∈ Rd is a trainable parameter. Global content bias is global because, according to the authors, query vector U is the same for all query positions, which means the attentive bias towards different words should remain the same regardless of the query position. This term has the same separate W (k,e)T as the previous item. • EW (q) W (k,r)T R captures a content-dependent positional bias between the current and previous segments. • VW (k,r)T R encodes a global positional bias. With the same logic, V ∈ Rd is a trainable parameter. comparing with the formula of absolute position encoding used by [17], which is shown by Eq. 23 A = EW (q) W (k)T ET +PW (q) W (k)T ET +EW (q) W (k)T PT +PW (q) W (k)T PT (2) there were three major modifications: • First, they changed the absolute positional embedding P used in the landmark paper of attention [17] by R, which is a sinusoid encoding matrix without learnable parameters. • Second, they replaced the projected absolute position of the query with parameters U and V, as explained above. • Third, purposely, they separate the two weight matrices W (k,e) and W (k,r) to produce the content-based key vectors (W (k,e)T ETk ) and location-based key vectors (W (k,r)T R) respectively. 2 3
Look into the section Positional encoding in transformers in the supplementary material. For more information about that look into supplementary material subsection absolute position encoding.
96
A. Al Adel
Finally, Attention of Transformer-XL can be written as follows: A = (Eq W (q) + U)W (k,e)T ETk + (Eq W (q) + V)W (k,r)T R
(3)
– Compressive transformers [14] used the same relative positional embedding scheme proposed by Transformer-XL. – GMAT [7] positional embedding here is represented in a new way. It is a tuple (q, r); r = p(mod512) and q = p/512 learnable, and is nothing else mentioned about this positional embedding in this paper or later papers. R=Q+A
–
– –
–
3
(4)
Each 0 ≤ r < 512 and 0 ≤ q < 64 has a distinct learnable vector Q for the learnable vector of q and A for r. It is essential to mention that the authors noted that vectors for (r) could be replaced easily with any learnable vectors such as BERT. Model positions are up to 215. Memory tokens have a fixed position; thus, positional embeddings are used only for the main sequence. BigBird [18]: Using an encoder setup, this model uses BERT as a backbone for MLM tasks. Its main modification is the new design for the attention mechanism. Information about the position embedding was mentioned in the paper very shortly by referring that they add them to the input matrix without any additional details. They also applied the BigBird model just on the encoder side for the encoder setup. ETC [1], in this model the relative position was used to handle the structured input, as done in [16]. To scale long inputs they used local sparsity. Longformer [2]: For the encoder setup, the authors replaced attention in RoBERTa with the Longformer attention mechanism. Authors added extra position embeddings to support longer documents. They did not randomly initialize the new position embeddings. They initialize them by copying the 512 position embeddings from RoBERTa multiple times to take advantage of RoBERTa’s pre-trained weights. For encoder decoder setup(LED) LongT5 [6]: In this work, they used the same relative T5 style [15] and use it as ETC [1] used it where authors of ETC depended on [16] to represent hierarchical structures. Since it uses the same ETC attention ideas, it suffers from the same issues, it needs special data processing.
The Proposed Approach
Although the approach proposed in this paper is multi-pronged, it is simple. We mimic humans when they read long documents by chunking them into parts(equal chunks). That is why we will process the documents one by one. First, we dynamically segment the input (one long document) into non-overlapped segments using a fixed block or segment length. We consider it a hyper-parameter. We pond a memory slot (one or more memory tokens; it is a hyper-parameter) to each segment. We use modified T5 relative position encoding to relate segments through the memory slot position encoding. We use masking so each
SAMDIT
97
memory slot depends just on its related segment, and each segment relates at the same time to all memory slots of all segmented long input in the new proposed T5MemAttention in the decoder. As an example of the proposed attention take a look into Appendix in the supplementary material4 .
4
Datasets
– CNN/Daily Mail [8]: version 3.0.0, English-language data set. This data set can be used for both abstractive and extractive summarization. It has two fields: an article as a long text and highlights that consist of a one- or twosentence summary. These articles written by journalists at CNN between April 2007 and April 2015, and at Daily Mail between June 2010 and April 2015. – SAMSum [5]: This data set was written by linguists used for abstractive summarization. It contains about 16k messenger-like conversations with summaries. – GovReport proposed by [9] for summarizing long documents, this dataset is challenging for many reasons. First, the summary itself is long and scattered over a very long document (Table 1).
Table 1. Statistics for the used summarization data sets. The input length is measured in tokens using a pre-trained T5 tokenizer. Data set
Number of Instances in Split Length of tokenized input Length of tokenized target Train Validation Test Mean Median Max Mean Median Max
SAMSum
14732
818
819
GovReport
17517
973
973
cnn dailymail 287113 13368
11490
5
119
1153
28
25
94
10305 8571
148
324004
637
658
2360
5269
75
70
3151
985
898
Experiments Setup
For fair comparisons between all models, the setup was unified for all experiments on each data set separately. Since the model was based on T5, I compared it with SLED [10], which uses T5 as its backbone model. SLED is used as a model that can handle long documents to compare with. Starting with model configurations: We initialize our models with the pretrained T5 models [15], available in the HuggingFace Transformers library version 4.24.0 to avoid the high cost of pre-training from scratch. The additional parameters for added memory positional encoding were initialized and trained on the given data sets. We consider one model size base containing respectively 4
Supplementary material.
98
A. Al Adel
222903552 parameters for T5 model, 222904320 for the T5Mem model, and 247534848 parameters for the SLED model. For other training hyperparameters, we fine-tune the models on each data set independently, using AdamW [13] optimizer with a constant learning rate of 10−5 and weight decay of 0.001, and a dropout rate of 0.1. We train the model over each data set with a batch size of 8, using 8 NVIDIA DGX A100 with 40 Gb memory so each device only performs processing over one document at a time, simulating the human way of reading long documents one by one for experiments on GovReport and CNN/Daily Mail data sets. The number of training epochs is mentioned for each experiment separately. For repeatability, we used a seed equal to 42 for all experiments, and during evaluation and testing, we used beam search with the number of beams 5 used for the generation. No penalty was used, and the maximum target output was defined for each data set during generation. We evaluate models for every epoch on each data set. T5 pre-trained tokenizer was used for all models, which makes the comparison more centered on used models and data used, excluding the tokenization effect during training, evaluation, and testing (Table 2). Table 2. Length of model’s input, output, and length of generated text for evaluation and test. “no limit” means that the whole tokenized input was consumed by the models in all reported results. Data set
Maximum tokenized input length
SAMSum
no limit
94
94
GovReport
3072
384
384
cnn dailymail 1024
128
128
6
Maximum tokenized target length
Maximum tokenized generated tokens length
Results and Discussion
– Results on SAMSum Dataset: as can be seen in Table 4, T5mem succeeded in giving better results than baseline considering all metrics except the R1 metric and overcoming SLED that used block size 256 considering all metrics5 (Table 3). Increasing memory capacity was supposed to give us better results, practically using less memory capacity (8 memory tokens for every 256 tokens) was more efficient computationally and gets stronger ROUGE scores on the evaluation data set using 384 block size and better scores on test data set using 256 block size6 . More results about memory content and Memory attention can be found in the supplementary material7 . After analyzing the memory content, it was shown that the memory tokens in each memory slot have just one token in all positions for each chunk. Thus, the reason for the ineffectiveness of increasing memory size, became fully understandable. 5 6 7
Experiments on SAMSum data set were done using 8 Tesla P100 SXM2 GPU. Results of training models and the baseline can be found on wandb site. supplementary material.
SAMDIT
99
Table 3. Comparing the Rouge metrics for the proposed model with T5 as baseline and SLED as another model for processing long documents on SAMSum data set. 256 and 364 represent the block size, and 32,16,8 represent the number of memory tokens in each slot. Model
Validation R1 R2
RL
Test RLsum R1
R2
RL
Loss RLsum train
validation test
T5-base (base line)
52.543 28.181 43.734
48.433
50.895
25.852
42.068
46.316
1.199 1.328
1.234
SLED 256
43.023
20.202 35.232
39.198
42.084
18.893
34.345
38.15
2.485
1.74
1.622
T5mem-base 256 32 52.092
27.855 43.46
47.994
50.394
25.904
42.032
45.97
1.201
1.332
1.241
T5mem-base 256 16 51.84
27.654 43.087
47.76
51.086
26.672 42.643
46.76
1.224
1.328
1.233
T5mem-base 256 8
52.084
27.815 43.664
48.047
51.305 26.346
T5mem-base 384 8
52.503
28.23 43.853 48.507 50.996
26.002
42.685 46.916 1.22
1.347
1.24
42.251
1.331
1.236
46.456
1.203
– Results on CNN/DailyMail Dataset: The experiments on CNN/Daily Mail were for 10 epochs, but the results in the Table 4 are for epoch six since the models start to overfit after epoch four8 . Table 4. Comparing the Rouge metrics for the proposed model with T5 as baseline and SLED as another model for processing long documents on CNN/Daily Mail data set. Model
Validation R1 R2
RL
Test RLsum R1
R2
RL
Loss RLsum train validation test
T5-base (base line)
44.074 21.421 31.154 41.024
43.21
20.647 30.566 40.102
1.262 1.409
1.443
SLED 256
42.196 19.99
–
–
1.609 –
–
T5mem-base 8mem 384b
43.67
42.905 20.483 30.381 39.871
1.255 1.394
1.422
42.777 20.332 30.226 39.723
1.295 1.442
1.455
29.865 39.14
21.023 30.759 40.603
T5mem-base 32mem 256b 43.547 26.971 30.76
40.532
–
–
As we can see from the table, our model overcomes SLED as a model for processing long documents, but both our model and SLED were not able to overcome the baseline T5 under the same conditions. Many notes can be done here since both SLED and our model depend on fusion separate encoder representation in the decoder. We can infer that using memory slots is important for communication between the blocks, since as SLED configuration there is no prefix used for the summarization task. – Results on GovReport: Results are displayed in Table 5. Since a Powerful server was used9 and for a fair comparison, the same long input length was used for all experiments on GovReport data set10 . As tokenized input length is 3072 and tokenized target length is 384. As we see, T5 still overcomes both models SLED and our model. Hence our model overcomes SLED considering all metrics. This emphasizes that using the whole length is better than chunking it, but on the other case it needs more resources(time and computational 8 9 10
More information about results can be found on wandb site. NVIDIA DGX A100 320GB. Full experiments training and hyperparameters details can be found and displayed on wandb site.
100
A. Al Adel
cost). GovReport is a very challenging data set due to two reasons; the input is very long, and the output is long too. Table 5. Comparing Rouge metrics for the proposed model with T5 as baseline and SLED as another model for processing long documents on GovReport data set. Model
Validation R1 R2
RL
Test RLsum R1
R2
RL
Loss RLsum train validation test
T5-base (base line)
52.796 22.655 27.96
48.866
53.35
23.328 28.614 49.578
1.773 1.751
1.808
SLED1024 256
38.59
14.579 25.357 33.801
38.82
15.058 25.785 34.254
2.216 1.963
1.959
SLED3072 256
36.02
11.545 21.463 31.852
36.305 11.809 21.704 32.175
2.381 2.044
2.104
T5mem-base 8mem 256b 49.727 20.241 26.442 45.668
50.058 20.833 26.838 46.115
1.825 1.785
1.843
Consistent with results on previous data sets using less memory slot size with longer block length gave better results. This is understandable since the memory slot saves in it just one token in all used memory tokens, using more memory tokens will not help else these tokens save different compressed representations for each chunk.
7
Conclusion
Even though the state-of-the-art results of the transformer for different NLP tasks nowadays, processing long documents is still one of the main knocks of the transformer. Currently, processing long text inputs is a hot research area. This paper proposes a multi-sided transformer framework to process long inputs. Our main results show that chunking the long input into non-overlapped chunks during encoding allows these chunks to communicate through the related memory slots. Chunking the long input into equal chunks dynamically allows the computation time of the model to grow linearly with the number of chunks instead of quadratically. This scales well with the input length and gives competitive results on the summarization task. The proposed model outperforms SLED for all experiments. Based on the used dataset, the proposed model outperforms the baseline or works on par. Analyzing memory content reveals that memory tokens with the same positional encoding store identical tokens for each related chunk. Analyzing attention maps showed active interaction between memory tokens and related chunks and active interaction between chunk tokens and most of the memory tokens in all layers. Here we should note that the interaction was from the first layers where the memory was empty. In this case, It is worthy to investigate memory applications in a much more creative way than we did before. We can use separate attention for each memory slot and its related chunk, we can have an initial memory representation for the chunk first, then apply attention to the related chunk, and give memory tokens in each slot different positional encoding. This needs experiments with different design variants and investigating the best layer where for using the memory slots. Research in this direction is very extensive, and many variants can be investigated.
SAMDIT
101
References 1. Ainslie, J., et al.: Encoding long and structured inputs in transformers. In: EMNLP (2020) 2. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. ArXiv abs/2004.05150 (2020) 3. Dai, Z., Yang, Z., Yang, Y., Carbonell, J.G., Le, Q.V., Salakhutdinov, R.: Transformer-XL: attentive language models beyond a fixed-length context. ArXiv abs/1901.02860 (2019) 4. Gidiotis, A., Tsoumakas, G.: A divide-and-conquer approach to the summarization of long documents. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 3029–3040 (2020) 5. Gliwa, B., Mochol, I., Biesek, M., Wawer, A.: SAMSum corpus: a human-annotated dialogue dataset for abstractive summarization. In: Proceedings of the 2nd Workshop on New Frontiers in Summarization, Hong Kong, China, pp. 70–79. Association for Computational Linguistics (2019). https://www.aclweb.org/anthology/ D19-5409 6. Guo, M., et al.: LongT5: efficient text-to-text transformer for long sequences. In: NAACL-HLT (2022) 7. Gupta, A., Berant, J.: GMAT: global memory augmentation for transformers. ArXiv abs/2006.03274 (2020) 8. Hermann, K.M., et al.: Teaching machines to read and comprehend. In: Advances in Neural Information Processing Systems (NIPS) (2015). http://arxiv.org/abs/ 1506.03340 9. Huang, L.R., Cao, S., Parulian, N.N., Ji, H., Wang, L.: Efficient attentions for long document summarization. In: North American Chapter of the Association for Computational Linguistics (2021) 10. Ivgi, M., Shaham, U., Berant, J.: Efficient long-text understanding with short-text models. ArXiv arXiv:2208.00748 (2022) 11. Izacard, G., Grave, E.: Leveraging passage retrieval with generative models for open domain question answering. In: Conference of the European Chapter of the Association for Computational Linguistics (2020) 12. Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Annual Meeting of the Association for Computational Linguistics (2019) 13. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2017) 14. Rae, J.W., Potapenko, A., Jayakumar, S.M., Lillicrap, T.P.: Compressive transformers for long-range sequence modelling. ArXiv arXiv:1911.05507 (2020) 15. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv arXiv:1910.10683 (2020) 16. Shaw, P., Massey, P., Chen, A., Piccinno, F., Altun, Y.: Generating logical forms from graph representations of text and entities. ArXiv arXiv:1905.08407 (2019) 17. Vaswani, A., et al.: Attention is all you need. In: NIPS (2017) 18. Zaheer, M., et al.: Big bird: transformers for longer sequences. ArXiv arXiv:2007.14062 (2020) 19. Zhang, J., Zhao, Y., Saleh, M., Liu, P.J.: PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. ArXiv arXiv:1912.08777 (2020)
Towards Solving Classification Tasks Using Spiking Neurons with Fixed Weights Alexander G. Sboev1,2(B) , Alexey V. Serenko1 , Dmitry E. Kunitsyn1,2 , Roman B. Rybka1,2 , and Vadim V. Putrolaynen3 1 National Research Centre “Kurchatov Institute”, Moscow, Russia
[email protected]
2 National Research Nuclear University MEPhI, Moscow, Russia 3 Institute of Physics and Technology, Petrozavodsk State University, Petrozavodsk, Russia
Abstract. The paper is devoted to an unexplored problem of using layers of spiking neurons with fixed (non-trainable) weights within a network for solving classification tasks. A layer of spiking neurons with non-trainable weights, either fixed on the base of logistic functions or drawn from a uniform random distribution, is shown to be an efficient extractor of meaningful features for their subsequent processing by a linear classifier. The output spiking rates of the proposed layer allow predicting classes by logistic regression with F1-macro scores of 94%, 96% and 97% respectively for the classification tasks of handwritten digits, Wisconsin breast cancer and Fisher’s Iris. Therefore, the proposed layer could serve as a feature extraction layer, facilitating the development of compact and efficient SNN models for solving classification tasks. Keywords: spiking neural networks · logistic maps · reservoir computing
1 Introduction Spiking neural networks, where information is represented in sequences of spikes [1], are promising for machine learning applications due to the possibility of their hardware implementations in neuromorphic hardware with ultra-low power consumption [2, 3]. Several approaches exist for creating a spiking neural network that solves a classification task [4]. The synaptic weights of the network can be obtained by converting an external trained artificial network of Rectified Linear Unit (ReLU) neurons [5], but it would not allow one to benefit from the advantages of hardware implementation during the training stage. Training the spiking network directly with error backpropagation [6] proved to achieve high accuracy with deep networks [7] but calculating the error and delivering it to every synapse to govern its weight change would complicate the circuitry of a neurochip and might become a bottleneck in large networks. At the same time, spiking networks can be trained on the base of local synaptic plasticity [8] implementable in digital [9] or analog [10–12] devices to solve a number of classification tasks [13–19]. However, a method for obtaining spiking neural networks that would be implementable in hardware and at the same time robustly efficient on various classification tasks still remains an unresolved problem [4, 20]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 102–110, 2023. https://doi.org/10.1007/978-3-031-44865-2_11
Towards Solving Classification Tasks Using Spiking Neurons
103
One promising approach for applying spiking neural networks to machine learning tasks is reservoir networks, where a large number of neurons with fixed (non-trainable) synaptic weights perform some random non-linear transformation of the input data, making it linearly separable. The synaptic weights being fixed helps reduce the number of trainable parameters of the network. However, the usage of a feedforward layer of such non-trainable spiking neurons has not yet been analyzed in detail. The aim of this paper is thus to study the performance of a non-trainable layer of spiking neurons in transforming the input data so that to make it linearly separable. Decoding the output spike counts into classes is performed by logistic regression. The contributions of this paper are the following: • Layers of spiking neurons with fixed weights proved to be an efficient extractor of meaningful features for their subsequent processing by a linear classifier. • Two methods for setting the synaptic weights are compared: logistic functions [21] and random weights, described in Sect. 4. Classification accuracy is shown to be similar with both methods. • Different input normalization and preprocessing approaches are tested (described in Sect. 3), and numerical experiments show that different classification tasks require different preprocessing in order to achieve a tradeoff between classification performance and the simplicity of preprocessing. • A classifier based on the proposed non-trainable layer is evaluated on several benchmarks from different domains: images of handwritten digits from the scikit-learn library and real-value vectors of Wisconsin Breast Cancer and Fisher’s Iris (described in Sect. 2). The accuracies obtained are on par with existing approaches for training spiking neuron networks.
2 Datasets In order to evaluate the performance of the proposed spiking layer, we use it together with a decoder to apply to three benchmark classification tasks. The Optical Recognition Of Handwritten Digits [22] (further referred to as Digits), as available in the scikit-learn library [23], contains the total of 1797 samples, about 180 samples per each of its 10 classes. Each sample is a 8x8 matrix of integer pixel intensities in the range 0 to 16. The Breast Cancer Wisconsin (Diagnostic) data set [24] (further called Breast cancer) consists of 569 samples of two species: malignant (212 samples) and benign (357 samples). Each sample is a vector of 30 rational values that describe various characteristics of the cell nuclei present in the digitized image of a fine needle aspirate (FNA) of breast mass. The Fisher’s Iris data set [25] (further called Iris) has 150 samples of 3 classes (Iris Setosa, Iris Virginica, and Iris Versicolor), 50 samples per class. Each sample is a 4dimensional real-value vector of sepal length and width, and petal length and width of a flower.
104
A. G. Sboev et al.
3 Input Preprocessing Before presenting the input samples to the spiking neurons, different methods of normalization and preprocessing are applied. Aiming towards a prospective fully spiking processing pipeline where as little computing demand as possible should be laid upon conventional computers outside of the spiking network, we prioritize the simplest preprocessing methods. If our proposed layer together with its decoder is unable to achieve satisfactory classification accuracy (i. e. comparable to that reported in the literature for other spiking-network-based approaches) without preprocessing, we try combinations of the following preprocessing methods: 1. Sigmoid function (further called Sigmoid): each component x i of a vector x of the dataset, is replaced with 1+e1−a·xi , where a > 0 is a tunable parameter, found as described in section “Experiments”; 2. Normalizing to zero mean and unit variance (further called StandardScaler): xi ← xi −u s , where u is the mean of the i-th component over all samples of the training set and s is its standard deviation over the training samples; 3. Gaussian receptive fields (further referred to as GRF): each component of the input vector is expanded into M values based on the distance of the component value x i to the center of the corresponding receptive field μj : x− i j 2 xi ← e( σ ) , j = 1...M , where j = min xi + M j−1 · (max xi −− min xi ), so that the receptive fields cover the entire range of the values of x i in the training set. σ = 21 · M 1−2 as commonly used for spiking neural networks [26–28]. In order to unify the ranges of all components i = 1...K, StandardScaler is applied before GRF.
4 Spiking Neuron Layer Setup The proposed layer (depicted in Fig. 1) consists of N neurons Leaky Integrate-and-Fire (see Sect. 5). Each of the input vector components is assigned a Poisson generator, and during the time T of presenting an input vector x, the i-th generator emits spikes with the mean rate R · xi . All generators are connected to all neurons via synapses with weights that are fixed throughout processing the entire dataset. Two ways of setting the weights are compared. Setting the weights on base of logistic functions [21], the efficacy wij of a synapse connecting i-th input generator to j-th neuron is defined by the following recurrence relation: i·π , wi1 = A · sin K ·B wij+1 = 1 − r · wij2 , where K is the number of inputs and A, B, and r are adjustable parameters.
Towards Solving Classification Tasks Using Spiking Neurons
105
Fig. 1. The scheme of employing the spiking layer in a classification task
For comparison, we also consider random weights drawn from a uniform distribution from wlow to whigh . With the Digits dataset, two ways of assigning the input vector component indices i to the image pixels are considered: the pixel matrix of the image is flattened into a vector either by concatenating its rows or by traversing it with the pattern [21, Fig. 1 (d)] that proved most efficient for the weights based on logistic functions, rescaling that pattern for the 8 × 8 image. The latter way is called image transformation in Tables 1 and 2. The output of the proposed layer is defined as a vector of N integer values: the numbers of spikes emitted by each of the N neurons during the time T of presenting an input vector.
5 Neuron Model The dynamics of a spiking neuron j = 1...N in the proposed layer is governed by the Leaky Integrate-and-Fire model [29], one of the most computationally simple among existing neuron models [30, 31], in which the state of a neuron is described by its membrane potential V j (t): Cm
Vrest − Vj (t) dVj j = + Isyn (t), dt τm
where Cm = 250pF, Vrest = −70 mV, τm = 5ms. As soon as Vj (t) ≥ Vth , the neuron fires a spike, after which Vj (t) is instantaneously reset to Vrest and is clamped to it during the refractory period τref = 2 ms. j Isyn is the incoming current from the synapses, to which an input spike emitted by i adds an exponential pulse: i-th input generator at time tsp j
Isyn (t) =
i
wij
qsyn i tsp
τsyn
e
−
i t−tsp τsyn
i , θ t − tsp
where qsyn = 5fC, τsyn = 5 ms, θ(t) is the Heaviside step function, and wij is the weight of the synapse connecting i-th input generator to the current neuron. The neuron constants are set in accordance with our prior work [32], except the membrane capacity C m and the threshold V th which are adjusted separately for each classification task.
106
A. G. Sboev et al.
6 Experiments In order to evaluate the feasibility of the proposed non-trainable spiking layer as the first layer within a prospective spiking neural network, it is applied to classification tasks together with a decoder, which fulfills the role of a substitute of the prospective subsequent spiking layer or layers. We consider the feature extraction layer efficient if it makes the data linearly separable, and therefore we use a linear classifier as the decoder, for which we choose Logistic Regression (from the scikit-learn library, with all parameters left at their default values except the maximum number of iterations). Classification performance is assessed using 5-fold cross-validation: all vectors of the dataset are split into five non-overlapping subsets, each subset containing approximately equal number of vectors from each class (stratified 5-fold splitting with shuffling). In different folds, a different subset is considered the testing set, and the remaining subsets form the training set. For each fold, the classification of its training and testing sets involves the following: 1. The training and testing sets are preprocessed, the mean and variance (if StandardScaler is used) or minimum and maximum (if GRF are used) of each component being calculated using the training set. 2. The weights of the spiking layer are set using one of the two methods under consideration; if weights are random, they are generated independently for each fold. The training and testing sets are encoded by spike rates and presented to the spiking neurons, and the output spike counts are recorded. 3. The Logistic Regression decoder is trained on the output spike counts of the training set, and then predicts the classes for the training and testing sets. 4. The F1-macro metric is calculated for the training and testing sets. The overall classification performance is characterized by the mean, minimum and maximum values of F1-macro on the testing sets over all folds. The preprocessing parameters – the sigmoid function parameter a or the number of receptive fields M; the spiking layer parameters – the number of neurons N, the neuron threshold V th , the input spike rate coefficient R, the input sample duration T; the weight parameters – r, A, and B for weights based on logistic functions and wlow and whigh for random weights – have been adjusted separately for each combination of dataset, preprocessing, and weight setting. Parameter adjustment has been performed using hyperopt [33], with the objective to maximize being the F1-macro score on the training set averaged over all folds. The optimal parameters found are presented in Table 1.
Towards Solving Classification Tasks Using Spiking Neurons
107
Table 1. Input preprocessing and spiking layer parameters used for each dataset and weight setting №
Dataset
Preprocessing
Spiking layer parameters N
Weight parameters
V th , mV
R, Hz
T, ms
1
Digits
GRF with M = 6
217
r = 1.99, A = 0.3, B = 5.9
-66.32
18510
278
2
Digits
Sigmoid with a = 2.99, image transformation
217
r = 1.99, A = 0.3, B = 5.9
-66.32
18510
278
3
Breast cancer
–
249
wlow = 0.14, whigh = 4.08
-56.75
210
346
4
Breast cancer
–
586
r = 1.58, A = 0.3, B = 5.9
-47.43
496
146
5
Iris
–
230
r = 1.08, A = 0.3, B = 5.9
-68.95
5407
1903
6
Iris
–
11
wlow = 0.45, whigh = 82.3
-61.96
10669
5000
7 Results On the Fisher’s Iris and Wisconsin Breast Cancer datasets, accuracy comparable to existing results of spiking neural networks can be achieved without any preprocessing (see rows 3–6 in Table 2) and adding the GRF preprocessing does not lead to a notable improvement in accuracy (not included in the table). On the Digits data set, however, the spiking layer model with weights based on logistic functions requires preprocessing with the sigmoid and transforming the image into a vector using the transformation pattern (see row 2 in Table 2), otherwise the decoder fails to achieve satisfactory accuracy (not included in the table). The layer with random weights, in its turn, requires Gaussian receptive fields (see row 1 in Table 2), but cannot work with the sigmoid preprocessing or without preprocessing (not included in the table). Overall, on all the three datasets, setting weights on base of logistic functions leads to accuracy similar to setting them randomly.
108
A. G. Sboev et al.
Table 2. The mean, minimum and maximum F1-macro on the testing set of 5 cross-validation folds, achieved for each dataset and for each method of setting weights, along with accuracies of existing spiking neural network learning methods from the literature №
Dataset
Preprocessing
Weights
F1 macro Mean
Min
Max
1
Digits
GRF
random
0.94
0.93
0.95
2
Digits
Sigmoid, image transformation
logistic function
0.89
0.87
0.91
3
Breast cancer
–
random
0.96
0.92
0.98
4
Breast cancer
–
logistic functions
0.87
0.83
0.91
5
Iris
–
logistic functions
0.97
0.93
0.99
6
Iris
–
random
0.93
0.90
0.96
7
Digits
GRF
learned by local plasticity [34]
0.85
0.83
0.88
8
Breast cancer
GRF
learned by local plasticity [17]
0.90
0.88
0.92
9
Iris
GRF
learned by local plasticity [17]
0.97
0.95
1.00
8 Conclusions A layer of spiking neurons with random weights proved to be an efficient extractor of meaningful features for their subsequent processing by a linear classifier. Accuracies comparable to those of other existing spiking neural networks have been achieved on the tasks of classifying handwritten digits, Wisconsin breast cancer dataset, and on the non-linear task of Fisher’s Iris. Thus, the proposed approach could allow constructing compact and stable models with a low number of trainable parameters that would be more suitable for implementation on a neuromorphic chip with hardware limitations. Acknowledgements. This work has been supported by the Russian Science Foundation grant No. 23–11-00260 and has been carried out using computing resources of the federal collective usage center Complex for Simulation and Data Processing for Megascience Facilities at NRC “Kurchatov Institute”, http://ckp.nrcki.ru/.
References 1. Paugam-Moisy, H., Bohte, S.M.: Computing with spiking neuron networks. Handbook of natural computing. 1, 1–47 (2012) 2. Rajendran, B., et al.: Low-power neuromorphic hardware for signal processing applications: a review of architectural and system-level design approaches. IEEE Signal Processing Magazine 36(6), 97–110 (2019)
Towards Solving Classification Tasks Using Spiking Neurons
109
3. Furber, S.: Large-scale neuromorphic computing systems // Journal of neural engineering 13(5), 051001 (2016) 4. Taherkhani, A., et al.: A review of learning in biologically plausible spiking neural networks. Neural Networks 122, 253–272 (2020) 5. Diehl, P.U., et al.: Conversion of artificial recurrent neural networks to spiking neural networks for low-power neuromorphic hardware. In: 2016 IEEE International Conference on Rebooting Computing (ICRC), 1–8. IEEE (2016) 6. Lee, J.H., Delbruck, T., Pfeiffer, M.: Training deep spiking neural networks using backpropagation. Frontiers in neuroscience 10, 508 (2016) 7. Tavanaei, A., et al.: Deep learning in spiking neural networks. Neural networks 111, 47–63 (2019) 8. Khacef, L., et al.: Spike-based local synaptic plasticity: A survey of computational models and neuromorphic circuits. arXiv preprint arXiv:2209.15536 (2022) 9. Davies, M., et al.: Loihi: a neuromorphic manycore processor with on-chip learning. Ieee Micro 38(1), 82–99 (2018) 10. Saïghi, S., et al.: Plasticity in memristive devices for spiking neural networks. Frontiers in neuroscience 9, 51 (2015) 11. Serrano-Gotarredona, T., et al.: STDP and STDP variations with memristors for spiking neuromorphic learning systems. Frontiers in neuroscience 7, 2 (2013) 12. Shvetsov, B.S., et al.: Parylene-based memristive crossbar structures with multilevel resistive switching for neuromorphic computing. Nanotechnology 33(25), 255201 (2022) 13. Diehl, P.U., Cook, M.: Unsupervised learning of digit recognition using spike-timingdependent plasticity. Frontiers in computational neuroscience 9, 99 (2015) 14. Querlioz, D., et al.: Immunity to device variations in a spiking neural network with memristive nanodevices. IEEE transactions on nanotechnology 12(3), 288–295 (2013) 15. Demin, V., Nekhaev, D.: Recurrent spiking neural network learning based on a competitive maximization of neuronal activity. Frontiers in neuroinformatics 12, 79 (2018) 16. Demin, V.A., et al.: Necessary conditions for STDP-based pattern recognition learning in a memristive spiking neural network. Neural Networks 134, 64–75 (2021) 17. Sboev, A., et al.: Solving a classification task by spiking neural network with STDP based on rate and temporal input encoding. Math. Metho. Appl. Sci. 43(13), 7802–7814 (2020) 18. Masquelier, T., Guyonneau, R., Thorpe, S.J.: Spike timing dependent plasticity finds the start of repeating patterns in continuous spike trains. PloS one. 3(1), e1377 (2008) 19. Sboev, A., Vlasov, D., Rybka, R., et al.: Modeling the dynamics of spiking networks with memristor-based STDP to solve classification tasks. Mathematics 9, 3237 (2021) 20. Yamazaki, K., et al.: Spiking neural networks and their applications: a review. Brain Sci. 12, 863–2022 (2022) 21. Velichko, A.: Neural network for low-memory IoT devices and MNIST image recognition using kernels based on logistic map. Electronics 9(9), 1432 (2020) 22. Alpaydin, E., Kaynak, C.: Optical recognition of handwritten digits data set. UCI Machine Learning Repository (1998) 23. https://scikit-learn.ru/example/the-digit-dataset 24. Street, W.N., Wolberg, W.H., Mangasarian, O.L.: Nuclear feature extraction for breast tumor diagnosis. Biomedi. Ima. Proc. Biomedi. Visualiz. SPIE 1905, 861–870 (1993) 25. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Annals of eugenics 7(2), 179–188 (1936) 26. Gütig, R., et al.: Learning input correlations through nonlinear temporally asymmetric Hebbian plasticity. J. Neurosci. 23(9), 3697–3714 (2003) 27. Yu, Q., et al.: A brain-inspired spiking neural network model with temporal encoding and learning. Neurocomputing 138, 3–13 (2014)
110
A. G. Sboev et al.
28. Wang, X., et al.: Mobile robots’ modular navigation controller using spiking neural networks. Neurocomputing 134, 230–238 (2014) 29. Burkitt, A.N.: A review of the integrate-and-fire neuron model: II. Inhomogeneous synaptic input and network properties. Biological cybernetics 95, 97–112 (2006) 30. Izhikevich, E.M.: Dynamical systems in neuroscience. MIT Press (2007) 31. Kudryashov, N.A., Rybka, R.B., Sboev, A.G.: Analytical properties of the perturbed FitzHugh–Nagumo model. Applied Mathematics Letters 76, 142–147 (2018) 32. Sboev, A., et al.: A spiking neural network with fixed synaptic weights based on logistic maps for a classification task. In: The 6th International Workshop on Deep Learning in Computational Physics, p. 10 (2022) 33. Bergstra, J., Yamins, D., Cox, D.: Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In: International conference on machine learning, pp. 115–123. PMLR (2013) 34. Sboev, A., et al.: Ensembling SNNs with STDP Learning on Base of Rate Stabilization for Image Classification. Brain-Inspired Cognitive Architectures for Artificial Intelligence: BICA* AI 2020: Proceedings of the 11th Annual Meeting of the BICA Society 11, pp. 446– 452. Springer International Publishing (2021)
A Spiking Neuron Synaptic Plasticity Model Optimized for Unsupervised Learning Mikhail Kiselev1(B)
, Alexander Ivanitsky1 and Denis Larionov3
, Dmitry Ivanov2
,
1 Chuvash State University, Moskovsky Propect 15, Cheboxary 428018, Russia
[email protected]
2 Moscow State University, Leninskie Gory, Moscow 119991, Russia 3 Cifrum, Kholodilny Pereulok 3, Moscow 115191, Russia
[email protected]
Abstract. Learning in spiking neural networks is implemented through synaptic plasticity. Diversity of various learning regimes assumes that different forms of synaptic plasticity may be most efficient for, for example, unsupervised and supervised learning. In the present paper, we formulate specific requirements to plasticity rules imposed by unsupervised learning problems and construct a novel plasticity model satisfying these requirements. This plasticity model serves as main logical component of the novel unsupervised learning algorithm called SCoBUL (Spike Correlation Based Unsupervised Learning). Keywords: Spike Timing Dependent Plasticity · Unsupervised Learning · Winner-takes-all Network
1 Introduction Learning algorithms of traditional neural networks are based on the fact that they can be represented as smooth multi-dimensional functions. Their output values depend smoothly on the input values as well as on their synaptic weights. It makes possible to use gradient descent methods for their training treated as optimization of their synaptic weights. The error backpropagation algorithm based on gradient descent is a wellknown platform for majority of approaches to learning in traditional neural networks. In contrast, spiking neural networks (SNN) are discrete systems by their nature. The “all-or-nothing” behavior of spiking neurons makes the direct application of gradient methods impossible. The more adequate approach to SNN learning is based on reproduction of the synaptic plasticity principles observed in living neuronal ensembles, the principles that utilize the basic concepts of SNN – asynchronous operation of neurons and spiking nature of information exchange between them. These principles include the locality principle stipulating that rules for synaptic weight adjustment can include only parameters of activity of the pre- and post-synaptic neurons. A particular case of this general principle is the well-known Hebbian plasticity rule in accordance with which © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 111–120, 2023. https://doi.org/10.1007/978-3-031-44865-2_12
112
M. Kiselev et al.
the synapses frequently receiving spikes short time before postsynaptic spike generation are potentiated while the synapses with uncorrelated pre/post-synaptic activity are suppressed or are not modified. The locality principle is very general and does not fix exact relationship between weight dynamics and pre/post-synaptic activity characteristics. For this reason, a plenty of very different SNN synaptic plasticity rules have been proposed, that makes the situation with SNN learning strikingly different from the uniform approach to learning in traditional neural networks. The majority of these rules can be considered as generalizations of spike timing dependent plasticity (STDP) – the synaptic plasticity rule, experimentally observed in living neurons [1]. The pure “classic” STDP can hardly be used as a basis for implementation of learning in SNN, especially in recurrent SNN, due to its inherent instability – in accordance with STDP, a potentiated synapse automatically gets more chances to be potentiated further causing the “runaway” network behavior. It made researchers to invent STDP generalizations [2–4] leading to more balanced network dynamics. Further neurophysiological studies also showed that synaptic plasticity in different kinds of neurons in different brain regions are described by various plasticity models sometimes greatly declining from the classic STDP [5]. This fact enables us to think that different plasticity models are adequate for different tasks. In fact, even in the realm of traditional neural networks, network architectures and synaptic weight tuning algorithms are different for, say, supervised and unsupervised learning. While the layered feed-forward networks are commonly used for supervised learning, for unsupervised learning, the flat networks with (implicit) lateral inhibition like Kohonen SOM [6] proofed their efficiency. This thesis determined the motivation for this study. We would like to find SNN plasticity rules satisfying the locality principle and optimized for solution of one class of learning problem, namely, unsupervised learning. In the next Section, we formalize SNN unsupervised learning as a problem of finding spike frequency correlations. Further, we describe the novel synaptic plasticity model and the unsupervised learning algorithm SCoBUL based on it and show that it fits the specific requirements imposed by this representation of the unsupervised learning problem. In Sect. 4, a proof-of-concept level experimental evidence of SCoBUL efficiency using emulated DVS camera signal is presented.
2 Unsupervised Learning in Spiking Domain The problem of unsupervised learning in the most general case can be formulated as a search for certain features in the given dataset which distinguish it from the dataset with the identical statistical parameters (such as mean or standard deviation) but where each value is generated by the respective random number generator independently of other values. Nothing can be learnt from the data where each individual value is generated by independently working random number generators. Presence of some hidden structure, patterns which can be recognized by unsupervised learning algorithms is indicated, in the general case, by increased (or decreased) probability that certain values appear in certain places in the dataset comparatively to the situation when all data are completely
A Spiking Neuron Synaptic Plasticity Model Optimized
113
random. These probability deviations can be expressed in terms of correlations (or anticorrelations) between certain variables included in the dataset or calculated as derivative variables. All this remains valid for data represented as spike sequences, but in this case the data values are extremely simple – they are in fact Boolean (spike / no spike). Thus, we can say that unsupervised learning problem for SNN can be formulated without loss of generality as a problem of detecting correlations between spike frequencies in input spike trains. Detecting of anti-correlations is also covered by this approach due to use of inhibitory neurons capable of implementing the logical operation NOT. It follows from the discussion above that, in case of SNN, the problem of unsupervised learning can be represented as a problem of detection of spike frequency correlations for spikes emitted by input nodes or neurons of the network. Let us note that time scales of these correlations may be different – from exact coincidence of firing times to cases of concurrent elevation of mean spike frequency measured for long time intervals. It should be noted that such an understanding of unsupervised learning is very natural for spiking neurons. Indeed, the most basic operation characteristic for all spiking neuron models is detection of coinciding arrival of spikes to its synapses. Only when several synapses with sufficiently great weights receive spike inside more or less narrow time window, the neuron fires indicating this fact. In order to define the solved problem formally, we consider the following simplified but still quite general input signal model. We assume that the input signal is generated by N + 1 Poissonian processes. One of them works always and plays the role of background noise. Let us denote its intensity as p0 . The other processes numbered by the index i are switched on randomly with the probability Pi and operate during the time interval t i . During this time interval a certain set of input nodes (we will call it cluster) Ci emit spikes with the probability p0 + pi . This elevated activity of cluster’s nodes will be called pattern. Evidently, the activity of input nodes inside every cluster i is correlated and statistical significance of this correlation is determined by pi and the number of activations of this cluster in the whole observed input signal ni . The goal of unsupervised learning is to teach SNN to react specifically to these patterns in the input spike stream. Namely, due to the appropriate synaptic plasticity rules, for each cluster, a recognizing neuron should appear in the network. This neuron should fire when and only when the respective cluster is active. Thus, our problem is parametrized by the value of p0 and N corteges .
3 The Algorithm SCoBUL - Network, Neuron, Synaptic Plasticity In this work, we describe a novel SNN unsupervised learning algorithm approaching the unsupervised learning problem from the positions of spike frequency correlations. The algorithm is called SCoBUL (Spike Correlation Based Unsupervised Learning). An application of SNN to any problem related to learning includes three major logical components: network architecture, neuron model and plasticity rule. The novelty of the present work and the algorithm SCoBUL belongs mainly to the third component; while the network structure and the neuron model used here are quite common. Similar to the majority of studies devoted to unsupervised learning, we utilize the so-called winner-takes-all (WTA) network architecture [7]. It is a one-layer SNN where
114
M. Kiselev et al.
every neuron is connected to a certain subset of input nodes (possibly, to all of them) thru excitatory links and has strong lateral inhibitory projections to the other neurons. This structure can be considered as a spiking analogue of Kohonen’s self-organizing map [6], a very efficient architecture of traditional neural networks used in unsupervised learning tasks. The general idea of WTA is the following. Every neuron due to the respectively selected plasticity model tries to detect sets of input nodes with coinciding activity periods. At the same time, a neuron having learnt successfully to recognize such a group of correlated input nodes inhibits recognition of the same group by the other neurons blocking their activity by its inhibitory spikes emitted during activation of this group. Many extensions of this simple architecture have been proposed (for example, 2-layer WTA networks [8]) but, as it was said above, the direction of our movement is enhancement of the synaptic plasticity model. Only one important novelty related to network structure is introduced in this work – the network structure is variable – the neurons may die and be born again (or migrate, if you like…). Neuron may die if it is constantly inhibited and cannot fire for a long time. In this case, it is destroyed and re-created by the same procedure which was used for construction of the original neuron population at the beginning of the simulation. Due to this feature, a neuron, inhibited by its happier neighbors having managed to recognize the most significant correlations in the input signal, gets a chance to be resurrected with a new combination of synaptic weights, which could help it to recognize some still “unoccupied” weakly correlated input node set. The neuron model utilized is also very simple, probably, the simplest spiking neuron model used in research and applications. It is leaky integrate-and-fire (LIF) neuron [9]. Its simplicity makes it efficiently implementable on the modern digital (TrueNorth [10], Loihi [11]) and even analog (BrainScaleS [12], NeuroGrid [13]) neurochips. We also use the simplest synapse model – current based delta-synapse. When such a synapse receives spike, it immediately increases (or decreases – if it is inhibitory) neuron’s membrane potential by the value equal to its weight. The SCoBUL synaptic plasticity model can be called a generalization of STDP but it is modified in several directions. Below, we will consider them and discuss how they help solve the unsupervised learning problem formulated in the previous section. Similar to the classic STDP, weight modifications in SCoBUL also depend on relative timing of pre- and post-synaptic spikes and this dependence includes a temporal parameter τ P determining length of time interval inside which the pairs of spikes are considered as interrelated and can change synaptic weight. Describing the SCoBUL plasticity model below, we use the notion of plasticity period. Plasticity period is a time interval of length 2τ P centered at the moment of the postsynaptic spike but only if this postsynaptic spike is emitted after τ P or more time since the center of the previous plasticity period. It is important to note also that inhibitory connections are not plastic in this model. 3.1 Synaptic Resource The classic form of STDP has additive character. In accordance with STDP, synaptic weight is increased or decreased by a certain value depending on the relative position of pre- and post-synaptic spikes on the time axis. If this rule is applied without any restrictions or corrections then it can easily lead to senseless very high positive (or
A Spiking Neuron Synaptic Plasticity Model Optimized
115
negative) weights due to STDP’s inherent positive feedback. To prevent this, the values of synaptic weights are bounded artificially by a certain value from above and by zero from below. It solves the problem of unlimited synaptic weights but causes another problem – of catastrophic forgetting. Indeed, let us imagine that network was being trained to recognize something for a long time. As a result, the majority of synaptic weights became either saturated (equal to the maximum possible value) or suppressed (equal to 0). However, presentation of even few wrong training examples or examples containing other patterns or simply noise is sufficient to destroy the weight configuration learnt and nothing can prevent it. The network will forget everything it has learnt. In order to fight this problem, it was proposed in the several earlier works [14] to apply additive plasticity rules to the so-called synaptic resource instead of the synaptic weight. The value of synaptic resource W depends monotonously on the synaptic weight w in accordance with the formula w = wmin +
(wmin − wmax )max(W , 0) . wmin − wmax + max(W , 0)
(1)
In this model, the weight values lay inside the range [wmin , wmax ) - while W runs from −∞ to + ∞, w runs from wmin to wmax . When W is either negative or highly positive, synaptic plasticity does not affect a synapse’s strength. Instead, it affects its stability – how many times the synapse should be potentiated or depressed to move it from the saturated state. Thus, to destroy the trained network state, it is necessary to present the number of “bad” examples close to the number of “good” examples used to train it. It should be noted that this feature was found to be useful not only for unsupervised learning – we use it in all our SNN studies. Let us add that in the present research wmin is set equal to 0 everywhere. 3.2 Unconditional Synapse Depression When a synapse receives spike, its synaptic resource is decreased by the constant value d- but this decrease can happen at most once inside any time window of length 2τP. Why this simple rule is useful, we will see later – when other features of the SCoBUL model will be discussed. 3.3 Constant Symmetric STDP In our model, all presynaptic spikes arriving inside a plasticity period strengthen the synapse. However, a synapse can be potentiated at most once inside one plasticity period – by the spike coming first. It should be stressed that the relative order of pre- and postsynaptic spikes is not important. When presynaptic spike comes just after postsynaptic spike, it potentiates the synapse as well. Thus, this rule can be called symmetric STDP. Besides that, the value of the synaptic resource increment is the constant D+ , it does not depend on the exact time difference between pre- and post-synaptic spikes.
116
M. Kiselev et al.
3.4 Suppression of Strong Inactive Synapses This rule is a conceptually new addition to the classic STDP. It states that if a synapse with positive resource has not received a spike during current plasticity period it is depressed at its end by the constant D− . 3.5 Constant Total Synaptic Resource The last important logical component of SCoBUL is constancy of neuron’s total synaptic resource. Every time some synapse is potentiated or depressed, the resources of all other synapses are modified in the opposite direction and by the same value calculated from the condition that the total synaptic resource of the neuron should remain the same. Now, having described all logical components of the SCoBUL plasticity model, let us analyze and explain them from point of view of unsupervised learning problem formulated at the end of the previous section. Let us begin from point 3.2. In conjunction with point 3.5, it gives the following very useful effect. The classic STDP has many drawbacks, and one of them is uselessness of silent neurons. Indeed, in the classic STDP model, the process of weight modification is bound to firing. The neurons which do not fire are not plastic. Therefore, if some neuron is silent because it is constantly inhibited by other neurons, it will stay in this state forever, and, therefore, will be just useless burden consuming computational resources but producing nothing. In our model, combination of rules 3.2 and 3.5 gives the following effect. Activity of certain sets of input nodes makes to fire some neurons. Inhibition from these active neurons forces the synapses of silent inhibited neurons connected to the active input nodes redistribute their synaptic resource to the other synapses, connected with less active input node groups. Even if these weak input node groups could not force to fire any neuron in the initial network configuration, after this resource redistribution, some silent neurons may accumulate in the respective synapses the amount of synaptic resource sufficient to fire. Thus, this process of “squeezing” synaptic resource out of active synapses to less active synapses helps the network recognize all correlations in the input spike streams – not only the most significant ones. The fact that symmetric variant of STDP is more suitable for unsupervised learning than its classic asymmetric form is obvious. Indeed, activity period of correlated input node groups may be long. In case of possible random delays of spikes inside this activity time window, some of them may appear earlier, some – later. When neuron learns to recognize this group, its synapses connected to these nodes are strengthened. Therefore, it begins to fire earlier when this input node group gets activated. But it means that more of its synapses connected to these nodes will experience depression instead of facilitation. Then, further recognition improvement will be impossible. It is evident, that the symmetric STDP does not face this difficulty. At last, point 3.4 solves another hard problem of unsupervised learning. It would be desirable that one neuron recognized one cluster, and one cluster were recognized by one neuron. The situation when two neurons recognize the same cluster is cured by introduction of stronger mutual inhibition – if the lateral inhibition is sufficiently strong, the state when several neurons recognize the same pattern is evidently unstable. The problem of recognition of several clusters by one neuron is much harder. Strong lateral
A Spiking Neuron Synaptic Plasticity Model Optimized
117
inhibition cannot help here – instead it can make this undesirable state more probable. Other rules considered above cannot help as well. The rule 3.4 was designed especially to fight this unpleasant scenario. Indeed, if a neuron recognizes the clusters A and B, it means that its synapses connected to A and B are strong. Assume that A is active. The neuron fires and therefore, accordingly to rule 3.3, the connections leading to A are potentiated. However, the synapses connected to B have not received spikes during the respective plasticity period and are suppressed by rule 3.4. Thus, rule 3.4 makes the state when one neuron recognizes several independent correlated input node groups unstable. This discussion demonstrated how rules 3.1 – 3.5 help address various aspects and complications of the general problem of unsupervised learning, namely: • to recognize strong and weak correlations by the same network (to prevent situations when there are no recognizing neurons for some clusters); • to make recognizing neurons sufficiently specific (to prevent situations when some clusters are recognized by several neurons); • to make recognition unambiguous (to prevent situations when several clusters are recognized by the same neuron). In the next section, we will consider the experimental confirmation of these theses on artificially generated data imitating a signal from DVS camera.
4 Experimental Comparison of STDP and SCoBUL Plasticity Rules on an Imitated DVS Signal In order to evaluate the benefits of the SCoBUL plasticity comparatively to the standard STDP we decided to select a task close to real application of SNN, namely, processing of spiking video signal sent from a DVS camera. For this purpose, a program emulator of DVS camera has been created. To simplify and speed up the simulation experiments, we emulated small camera view field of size 20 × 20 pixels. 3 spike streams (channels) correspond to each pixel. Spike frequency in the first channel is proportional to the pixel brightness. The other two channels send spikes every time the pixel brightness increases or decreases by a certain value. Thus, the whole input signal includes 1200 spiking channels. The thresholds used to convert brightness and brightness changes into spikes are selected so that the mean spike frequency in all channels is close to 30 Hz. In these tests, we selected a very simple picture – a light spot moving in the view field of this imaginary DVS camera in various directions and with various speed. At every moment, coordinates and speed of light spot (i.e. the point in the 4-dimensional phase space occupied by it) are known. The task is to determine them from the current activity of the WTA network neurons. More precisely, the procedure is the following. The whole emulation takes 3000000 time steps (we assume that 1 time step = 1 ms). The time necessary for the light spot to cross the DVS view field is several hundred milliseconds. During first 2000 s the network is trained. Next 600 s are used to determine the centers of receptive field of every neuron in the phase space. Last 400 s are broken to 40 ms intervals. For each interval, the real central position of the light spot in the phase space and the predicted value of this position are determined. The predicted position is the weighted mean of neuron receptive field
118
M. Kiselev et al.
centers where the weight is amount of spikes emitted by the given neuron in this time interval. The value of mean squared distance between the real and predicted light spot positions in each time interval serves as a measure of unsupervised learning success. Small value of this difference would be an evidence that network’s neurons learnt to recognize specific positions of the light spot in the phase space. Euclidean metrics in the phase space was selected so that the standard deviation of light spot coordinate during the whole simulation period would be the same for all coordinates. We performed this test with the networks with the standard STDP and the SCoBUL plasticity. In order to make this competition fair we used in both cases the same network parameter optimization procedure based on genetic algorithm. The parameter variation ranges were the same or equivalent for both plasticity models; we also made sure that the optimum parameter values found were not close to the boundaries of the search space. To diminish the probability of accidental bad or good result, we took the criterion values averaged for 3 tests with the same hyperparameters but with different sets of initial synaptic weight values. The population size was 300, mutation probability per individual equaled to 0.5, elitism level was 0.1. Genetic algorithm terminated when new generation did not show criterion improvement. The optimized (minimized) parameter was the mean squared distance between the real and predicted light spot position divided by the mean squared distance between spot position and the centrum of all spot positions during the entire emulation. It is called “Normalized mean squared distance” on Fig. 1 showing the results obtained by genetic algorithm. We see that 6 generations were required for SCoBUL networks to reach criterion value stabilization, while in case STDP the stabilization was reached earlier – after 4 generations; and we see that SCoBUL networks give much more accurate light spot
Normalized mean squared distance
0.85 0.8
0.75 0.7 STDP
0.65
SCoBUL
0.6
0.55 0.5 1
2
3 4 Generaon No
5
6
Fig. 1. The course of minimization of the light spot position determination error in sequential generations of genetic algorithm for STDP and SCoBUL synaptic plasticity rules.
A Spiking Neuron Synaptic Plasticity Model Optimized
119
position/speed determination than STDP networks. While these results should be considered as preliminary and they should be verified in other unsupervised learning tasks, the supremacy of SCoBUL over the classic STDP in this case is undoubted.
5 Conclusion In this paper, the problem of unsupervised learning of SNN was re-formulated as a problem of finding spike frequency correlations in input spike streams. Using this approach and remaining inside the boundaries of the synaptic plasticity rule locality principle, we propose a modification of the classic STDP model, which optimizes it for unsupervised learning. Since our research project is oriented primarily to application of SNN to processing of sensory data represented in the spiking form and, as its most important particular case, to processing of DVS-generated signal, we used an artificially generated “DVS signal” as a benchmark to compare the standard STDP-based WTA network and the SCoBUL network. It was found that the SCoBUL model gives significantly better results. This result can be evaluated as promising and opening the way to further perfection of our model while more exact and complete measurements of its properties and possible limitations are obviously needed. The other goal of this research is creation of hardware-friendly version of STDPlike synaptic plasticity model. In SCoBUL, this goal is achieved due to the special scenario of application of rules 3.1 – 3.5. Some of these rules (3.2 – 3.4) are bound to pre- and post-synaptic spikes and therefore are applied very often. However, these rules are very simple – they have the form of addition or subtraction of certain constant values. Rules 3.1 and 3.5 (the synaptic resource renormalization and the calculation of synaptic weight from synaptic resource) include much more expensive operations like multiplication and division. However, it is admissible to apply them periodically with sufficiently long period (say, once per second). Thus, in general, SCoBUL may have more economic and/or fast implementation than the standard STDP – it depends on the concrete processor architecture used. Acknowledgments. The present work is a part of the research project in the field of SNN carried out by Chuvash State University in cooperation with the private company Cifrum and Kaspersky. Preliminary computations resulting in creation of the SCoBUL algorithm were performed on the computers belonging to Kiselev, Cifrum and Kaspersky. Cifrum’s GPU cluster was used for running the optimization procedure reported in Section 4. The SNN simulation package ArNI-X (author and owner – Kiselev) was used in this study.
References 1. Bi, G.Q., Poo, M.M.: Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci. 18(24), 10464–10472 (1998) 2. Chen, J.Y., et al.: Heterosynaptic plasticity prevents runaway synaptic dynamics. J. Neurosci. 33(40), 15915–15929 (2013)
120
M. Kiselev et al.
3. Zenke, F., Hennequin, G., Gerstner, W.: Synaptic plasticity in neural networks needs homeostasis with a fast rate detector. PLoS Comput. Biol. 9(11), e1003330 (2013) 4. Turrigiano, G.G.: The self-tuning neuron: synaptic scaling of excitatory synapses. Cell 135(3), 422–435 (2008) 5. Abbott, L.F., Nelson, S.B.: Synaptic plasticity: taming the beast. Nat. Neurosci. 3, 1178–1183 (2000) 6. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol. Cybern. 43(1), 59–69 (1982). https://doi.org/10.1007/bf00337288.S2CID206775459 7. Maass, W.: On the computational power of winner-take-all. Neural Comput. 12(11), 2519– 2535 (2000) 8. Kiselev, M., Lavrentyev, A.: A Preprocessing Layer in Spiking Neural Networks – Structure, Parameters, Performance Criteria. In: Proceedings of IJCNN-2019, Budapest, paper N-19450 (2019) 9. Gerstner, W., Kistler, W.: Spiking Neuron Models. Single Neurons, Populations, Plasticity. Cambridge University Press (2002) 10. Merolla, P.A., Arthur, J.V., Alvarez-Icaza, R.: A million spiking-neuron integrated circuit with a scalable communication network and interface. Science 345(6187), 668–673 (2014) 11. Davies, M., et al.: Loihi: a neuromorphic manycore processor with on-chip learning. IEEE Micro 38(1), 82–99 (2018) 12. https://brainscales.kip.uni-heidelberg.de/ 13. Benjamin, B.V., et al.: Neurogrid: a mixed-analog-digital multichip system for large-scale neural simulations. Proc. IEEE 102(5), 699–716 (2014). https://doi.org/10.1109/JPROC.2014. 2313565 14. Kiselev, M.: Rate Coding vs. Temporal Coding – Is Optimum Between?. In: Proceedings of IJCNN-2016, pp. 1355–1359 (2016)
Centre-Lateral Threshold Filtering as a Method for Neuromorphic Data Coding Viacheslav E. Antsiperov(B)
and Elena R. Pavlyukova
Kotelnikov Institute of Radioengineering and Electronics RAS, Moscow, Russian Federation [email protected]
Abstract. A new approach to the synthesis of the methods for centre-lateral threshold non-linear filtering (coding) of images distorted by Poisson noise, is considered. The approach is motivated by well-known mechanisms of human perception, in particular, by the universal mechanism of lateral inhibition implemented by retinal interneurons. Accordingly, it is assumed that the synthesis is focused on a special representation of images by sample of counts with a controlled size (sampling representation). Based on the specifics of the sampling representation, the generative image model is concretized to a parametric probabilistic model represented as a system of receptive fields. This model allows a simple procedure for estimating the count probability density, which, in turn, is taken as the basis for optimal coding. The optimal coding is understood here in the traditional statistical sense as a procedure for compressing the registered random data, taking into account the available a priori information about the parameters of the sampling representation. Based on the Poisson statistics of the count numbers for the centre/surround of the receptive fields, an analytical form of the compression procedure is obtained. A simple approximation for the synthesized procedure in the form of threshold discrimination of intensity sharp change in the region of receptive field is proposed. It is demonstrated that in such implementation the image coding (filtering) procedure turns out to be structurally similar to the currently actively developing algorithms for noise suppression by shrinking methods. Keywords: Neural Image Coding · Neuromorphic Systems · Non-linear Filters · Sampling Representations · Receptive Fields · Shrinkage
1 Introduction At present, machine learning (ML) methods focused on neural networks have achieved significant results. Their triumphant spread is appreciated by general public as a revolution, as the beginning of the era of artificial intelligence (AI) [1]. The most impressive solutions, based on ML platform, have been achieved in the natural language processing, automatic text translation, handwriting recognition, face detection in photos, etc. It should be noted that today the level of AI in these applications is quite comparable with the intelligence of an average person. As for the professional community, while sharing the high significance of the results obtained, in relation to the methods by which they © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 121–131, 2023. https://doi.org/10.1007/978-3-031-44865-2_13
122
V. E. Antsiperov and E. R. Pavlyukova
are achieved, the assessments are not so unambiguous. It is primarily due to the lack of a clear understanding of why the changes in the methods and approaches to AI, that occurred at the turn of the millennium and found often empirically, led to the amazing effectiveness of technologies. Recently, it has been quite often mentioned that the noted achievements can be associated with deeper and more adequate (conceptual) modeling in AI of some biological (human/higher mammals) mechanisms of the brain. Indeed, with the progress of modern ML technologies, every new stage of development added more and more new functions / elements modelling, for example, the hierarchical architecture of the brain, deep reinforcement learning, the interactions of short- and long-term memory, a recurrent forward–backward connectivity of multiple interacting brain regions, etc. (see [2]). It should be noted that AI borrows from neuroscience not only structural models or functional blocks, but also computational models, including the processing of sensory signals, the coding of related information and data neural representation. The focus of early AI systems on “black box” approach often led to a poor understanding of the problems associated with incorrect or inefficient calculations. The recent trend towards an adequate understanding and application of neuroscience knowledge in the field of AI development has actually led to the number of breakthrough technologies. A striking example of such a breakthrough is the development of the neocognitron concept, proposed by Fukushima [3] and inspired, in turn, by the studies of Hubel and Wiesel [4] of the visual system periphery. The Fukushima neocognitron is based on a set of connected components called S-cells and C-cells, which model simple and complex cells (neurons) of Hubel and Wiesel. S-cells and C-cells differ mainly in the types of their receptive fields (RF), which are usually understood as a set of neuron input synapses, providing connections with receptors or other neurons in the network. S-cells are in the first layer of the model and make up the RF of C-cells that are in the second layer. The general idea is to implement the concept of “from simple to complex” and turn it into a computational model, such as visual pattern recognition. However, as it was later discovered, the concept of receptive fields is much more general than a particular mechanism for pre-processing images at the visual system input [5]. Perhaps this is why numerous RF models are widely used now in modern neural networks to map the response properties of the neurons. This is extremely true for convolutional neural networks (ConvNets), which have become the de facto standard for image processing and computer vision. A good modern overview of computation RF models and their operational characteristics can be found in [6]. This work is also devoted to the modeling of receptive fields as the main components in image coding systems that simulate the neuro-coding of retinal photoreceptors data in the periphery of the visual system. Note that in the traditional information-theoretic interpretation, the coding problem is understood as the problem of maximum data compression under given restrictions on distortions (losses). From the neurophysiologic point of view, this problem implies matching the data compression (output of the retina neurons – RGCs) with the capacity of the transmitting channel – the optic nerve. Since, as is known, optimal coding significantly depends on the statistics of the data (in terms of information theory – the source model), a necessary element of coding is the estimation of the corresponding probability distributions, which gives the coding problem the
Centre-Lateral Threshold Filtering as a Method
123
character of a machine learning task. In a neurobiological context, this (unsupervised) learning is assumed to be learning from the same image data, i.e. it implies a kind of bootstrap approach. The two above interpretations of coding are closely related to the third one – the interpretation of coding as a sparse representation of data on the base of their characteristic features. The last is a general filtering problem, which implies the non-linear, adaptive filtering, focused on the local topology of the data. In the context of visual perception, this corresponds to the detection of sharp changes in intensity against the background of smooth illumination, i.e. enhancing image contrast. Thus, in view of the foregoing, it seems that data coding based on adequate modeling in the context of neurophysiological representations, including the RF representation, may well lead to the best results achieved in all of the above areas – information (rate/distortion) theory, machine learning practice and in non-linear image filtering. In this regard, the paper discusses a new approach to the problem of image coding based on centre-lateral adaptive RF filtering / neuromorphic data coding. It is based on the previously developed special image representation by samples of counts (sampling representations) [7], simulating the retina’s receptors data (photocounts). Since for sampling representations a complete statistical description is given by the product of the probability density function of independent counts [8], the proposed approach is based, in essence, on the classical problem of probability densities estimation [9]. In this paper, we restrict ourselves to the class of parametric estimation procedures [9], which imply a parametric family of probability distributions. Namely, it is proposed to use a family of distributions in the form of a mixture of components [10], treated as a model of a system of receptive fields (RF). Accordingly, the set of mixture weight estimates, calculated from the sampling representation (input), is considered as an image coding sequence (output). In this regard, the synthesis of optimal coding is understood in the paper as the statistical procedure of maximum a posteriori estimation (MAP) [11]. The present research is the development of the previous investigations [7, 8].
2 Image Representation by a Sample of Counts and Its Encoding with a of Receptive Field-Like Components A detailed discussion and justification of the input data model (image) on the periphery of the visual system, the retina, is presented in an earlier paper [7] (see also [8]). Below, in order to recall the main results and fix the notation, only the general characteristics of the proposed model are given without suppling additional argumentation and motivation. The main feature of the image model proposed (ideal image [7]), in comparison with the classical (digital, bitmap) representation, is that the image is considered as a set of random events – counts (often called photocounts). The statistical description of such a representation is given on the basis of the concept of an ideal imaging device [8]. This device registers the radiation corresponding to the image, whose physical (not digitized) intensity we denote as I (x). The result of registering the intensity I (x) is a set of counts X = {xi }, where xi , i = 1, ..., N are the coordinates of count registration event – random point within the photosensitive 2D-surface of an ideal device. Note that the number of recorded counts N is itself a random variable whose statistics is given by the Poisson
124
V. E. Antsiperov and E. R. Pavlyukova
distribution with the mean parameter N : N= αI (x)d x,
(1)
where the coefficient α = η(hν)−1 depends on the average energy of registered photons hν (h is Planck’s constant, ν is the characteristic radiation frequency) and on the dimensionless coefficient η, the quantum efficiency of the photosensitive material of imaging device. The set of random N counts {xi } can be also described by the probability distribution of random points {xi } of some inhomogeneous point Poisson process (PPP) with intensity function λ(x) = αI (x). Since the number of samples N is a random variable, this description is inconvenient for practical use (especially for large N ). Therefore, we proposed image representation by a set of random points, as in the original Poisson process, but with a fixed (controlled) total number of points Ns N . We have also demonstrated [7, 8] that the statistics of a fixed (non-random size Ns ) counts sample Xs = {xj } can be specified by a single distribution density of the form: ρ Xs = {xj }, |I (x) = nj=1 ρ xj |I (x) , (2) ρ xj |I (x) = I xj / I (x)d x. According to the model described above, the representation Xs = {xj }, j = 1, ..., Ns was proposed to name sampling representation. It was specifically emphasized in [7, 8] that the sampling representation (2) most adequately simulates the data of receptors (retinal rods/cons) in the outer layer of the retina (during preliminary image formation). It should be noted, however, that the spikes of the neurons of the retina inner layer (RGCs) sent to the brain are not directly the data given by photoreceptors but are formed with the help of many intermediate neurons (interneurons) of the middle and inner layers. As a result, output neurons aggregate counts from dozens and sometimes thousands of receptors located in small areas of the retina, called receptive fields (RFs). The systematic study of the RFs structure and the neural transformation (coding) of input data from the receptors into a sequence of output retinal data was first undertaken in the fundamental works of Hubel and Wiesel [12]. A good contemporary presentation of the structure and functions of RFs can be found in book [13]. The variety of functions and sizes of various RFs is determined by the types of associated ganglion cells (output neurons of the retina, RGCs). There are about ~ 20 types of RFs, but further, for simplicity, only the family of midget ganglion cells, encoding the spatial distribution of the intensity I (x) on , is considered. Typical responses of midget cells to the nature of illumination / dimming of the corresponding RF are determined by their centre-antagonistic structure. The ON-cell, for example, is excited by stimulation of the centre of the RF and inhibited by stimulation of the concentric surround. On the contrary, the OFF-cell is excited upon stimulation of the RF surround and inhibited upon stimulation of its centre [12]. The presence of the two types of midget cells is due, in particular, to the peculiarities of neuronal coding of positive/negative intensity changes within the RF (ON- is excited when the stimulation of the centre exceeds the average intensity over the field, OFF- in the other case).
Centre-Lateral Threshold Filtering as a Method
125
As for the spatial distribution of RFs, it was found that neighboring ON- and OFFcells have significantly overlapped fields, and RF cells of the same type practically do not overlap. At the same time, non-overlapping RFs of both types closely adjoin each other, forming a kind of mosaic that tightly covers the entire retina. [14]. Thus, for the mathematical formalization of the RF structure, it suffices to allow the intersection of ON-fields only with neighboring OFF-fields and prohibit intersections with ON-fields of the same type allowing, however, their boundaries to touch each other. The same is true if one swaps ON- and OFF- field types. In view of the noted symmetry, we will consider an RF system for only one, for example, ON- type of fields. Based on the previous brief review, we formalize the model for count probability density ρ xj |I (x) (2) as a parametric family of densities P = {ρ(x; θ)|θ ∈ }, which are the mixtures of K pairs of components {Ck (x), Sk (x)}, k = 1, . . . , K (see [7, 8]): K wk Ck (x) + vk Sk (x), ρ x; θ =
(3)
k=1
where θ = {wk , vk } are the positive mixture weights, which make up the set of model parameters and mixture components Ck (x) and Sk (x) which are a compact centre and an antagonistic surround of k-th RF. More precisely, the components Ck (x) and Sk (x) are the positive probability distribution densities having compact carriers ck = { x|Ck (x) > 0} and sk = { x|Sk (x) > 0}, that constitute together common carrier of k-th RF: k =
ck ∪ sk : Ck (x)d x = Sk (x)d x = 1. (4)
ck
sk
The context of the introduced parameters θ ∈ becomes clear if we recall the equivalence of the density ρ xj |I (x) and the normalized version of the intensity I (x), fixed by description (2). Essentially, (3) defines the decomposition of the intensity form I (x) in terms of the set of local basis functions {Ck (x), Sk (x)}, as it is done, for example, in the wavelet or any other multi-resolution analysis [15]. Relations (4) in this context fix the normalization of the basis. The analogy can be continued further if we assume that the carriers of the centre and the antagonistic surround of k-th RF do not have common intersection ck ∩ sk = ∅. Then, to the normalization conditions (4), one can also add more orthogonality-type conditions: Sk (x)d x = Ck (x)d x = 0. (5)
ck
sk
So, since any multi-resolution decomposition implies some encoding I (x ) → θ ∈ , the representation (3) can be also considered as an image RF-encoding ρ(x; θ) → θ . Moreover, we can obtain an explicit expression for the encoding results ρ(x; θ)) → {wk , vk } if we remember that the set of RF carriers { k } forms a partition of the retina, i.e. they tightly cover without intersections. With this in mind, we can move from the local orthonormality properties (4, 5) to their global counterparts, in which the integration of components Ck (x), Sk (x) is carried out over the carriers cl , sl with arbitrary l (the
126
V. E. Antsiperov and E. R. Pavlyukova
unity in (4) should be replaced by δlk – the Kronecker delta). It allows, just as in multi resolution analysis, to express parameters θ = {wk , vk } in terms of the corresponding integrals of ρ x; θ over the appropriate carriers: wk = vk =
ck
sk
ρ x; θ d x , ρ x; θ d x
(6)
thus, according transformation (6) we can interpret parameters wk , vk as the probabilities of definite count hitting the centre ck or the surround sk of k-th RF. As noted in [8], the neuro-physiological meaning of the RF-transformation (6) becomes much deeper if to move from parameters wk , vk to their linear combinations πk = wk + vk and δk = wk − σ vk . To do this, we first extend the integration regions in (6) to the full surface by introducing the characteristic functions of the carriers ck (x) = 1, iff x ∈ ck and sk (x) = 1, iff x ∈ sk and then linearly combine corresponding integrals to rewrite transformation (6) for new parameters πk , δk : πk = k (x)ρ x; θ d x , (7) δk = hk (x)ρ x; θ d x Where k (x) = ck (x) + sk (x) – characteristic functions of the k-th RF carrier k =
ck ∪ sk and hk (x) = ck (x) − σ sk (x) denotes the so-called center-on-surround-off (COSO) filter, having a constant positive center region embedded within a constant negative surround region [16]. This filter, due to its lower computational complexity, was proposed as an alternative to the Laplacian of a Gaussian (LoG) filter used by Marr and Hildreth [17] in the image edge detection algorithm. Both filters calculate the local Laplacian of intensity / density ρ(x; θ) and, basing on its zero-crossings, estimate the positions of the corresponding edges in assumption that the second derivatives along and across edge boundary are zero (as in the inflection point). The fact that with an appropriate choice of the parameter σ coefficient δk is also a multiple of the Laplacian is easily follows from (7). Indeed, assuming ρ(x; θ) is a smooth function on the carrier k up to quadratic terms in increments of coordinates
k , let us expand it in its center μ
x = x − μ k . .. Then, integrating it over (symmetric) k we get: δk = hk (x)ρ x; θ d x = c ρ x; θ d x − σ s ρ x; θ d x k
, (8)
s k ˆ μ ≈ Sc − σ Ss ρ μ k ; θ + 1 x2c − σ x2 Lρ k ; θ k
k
4
k
k
c s
where Skc and Sks denote the centre / surround areas of the k-th RF, x2 k and x2 k are the average squares of increments x on ck and sk , Lˆ is the Laplace operator. c s ˆ k ; θ). So, MarrIt is easy to see that by choosing σ = Sk /Sk we get δk ∼ Lρ(μ ˆ μ Hildreth’s edge detection Lρ k ; θ = 0 procedure becomes equivale to detecting δk ≈ 0. . Accordingly, if we consider that visual perception is highly sensitive to local contrasts, in particular to edges, then it becomes clear that neuromorphic RF-coding assumes transformation (7) rather than (6).
Centre-Lateral Threshold Filtering as a Method
127
An important note should be made regarding the above neurobiological interpretation of RF-coding. Since the zero crossings by δk contain essential information, the main attention should be paid to this feature. However, RFs containing zero-crossings can be characterized in a dual way: as fields with small |δk | or as pairs of adjacent RFs with large |δk | values, but of different signs. The second case seems to be more preferable, because in the first case small values can be associated not only with edges, but also with noise. Returning to the issue of encoding images with parameters θ = {wk , vk }, we note that transformations (6) or (7) are not the coding procedures themselves, since the probability density ρ(x; θ) under the integrals itself depends on the parameters. However, since we know the representing ρ(x; θ) sample of counts Xs = {xj }, we can use the standard technique presented, for example, in [18] for its evaluation. Namely, taking into account the asymptotic of the large numbers law and considering (6) as the averages of characteristic functions ck (x) and ck (x) on , we can estimate {wk , vk } replacing transformation (6) by sample (empirical) means: wk ≈ vk ≈
1 Ns 1 Ns
Ns
c x)= j j=1 k ( Ns s xj ) = j=1 k (
nck Ns , nsk Ns .
(9)
where nck and nsk are the numbers of counts from Xs in the centre and the surround of corresponding RF. It is easy to show that, within the framework of the approximation made, the parameters wk , vk (9) are indeed a probability distribution in full accordance with the above-mentioned interpretation: they are all non-negative and satisfy the normalization condition: K k=1
(wk + vk ) =
K 1 nk = 1. Ns
(10)
k=1
where nk = nck + nsk is the total number of counts in the sample Xs = {xj }, falling into the region of the k-th RF. Note, that the approximate values of parameters (9) do not depend on the form of the components Ck (x) and Sk (x)), but only on the form of their carriers
ck and sk (more precisely, on the number of counts within their boundaries). Hence, it follows that for an approximate estimate of the probability density ρ x; θ only the numbers nck and nsk of counts in centers/surrounds of receptive fields are needed. In other words, the sampling representation Xs = {xj } of the image can be reduced (compressed, encoded) by RFs to an “occupation number” representation YNs = {nck , nsk }, that in this context is the sufficient sampling representation statistic.
3 Centre-Lateral Threshold Filtering Based on the Partition of Sampling Representation by a System of Receptive Fields Basing on the representation of the image by the count numbers YNs = {nck , nsk }, registered in the centres / surrounds of the receptive fields, let us consider in more detail how these data can be used to encode the sampling representation Xs in order to transfer it to
128
V. E. Antsiperov and E. R. Pavlyukova
subsequent higher levels for processing / analysis. Let us assume for simplicity that the RF system is homogeneous – all its fields are identical in structure and functions. Let a typical field on have a round shape of area σ in the middle of which is the centre C of area σc < σ , embedded within a concentric surround S of area σs = σ −σc , respectively. It is assumed that a typical RF has a set of simple functions – it can determine the total number of counts n belonging to its carrier, the number of counts nc belonging to the centre and can calculate any of their linear combinations αn + βnc (in particular, find the difference ns = n − nc ). Due to the random nature of the counts, the numbers n, nc and ns are also random, although in view of n = nc + ns are jointly dependent. It is easy to show that for sampling representation Xs = {xj } all n, nc and ns are Poisson random variables. At this point, attention should be paid to the fact that since the centre C and the surround S of the RF are assumed to be non-overlapping regions, the pair nc , ns is statistically independent. Accordingly, their probability distributions have the form: Pc (nc |λ) = Ps (ns |μ) =
(σc λ)nc nc ! (σs μ)ns ns !
exp{−σc λ}, exp{−σs μ},
(11)
where λ and μ are the average count intensities at the centre C and at the surround S: α α λ= I (x)dx, μ = I (x)dx. (12) σc c σs s It is well-known, that parameters λ and μ of distributions (11) are the count means nc = σc λ and ns = σs μ. Vice versa, since nc and ns are estimates of their means nc and ns , the simple linear combinations nc /σc and ns /σs are the estimates of the count intensities λ and μ. In other words, in the model considered, RF can form (unbiased) estimates for the intensities λ and μ (even for any of their linear combinations). Multiplying the probabilities (11), we obtain the joint distribution of independent nc and ns . However, it is more convenient in the context of current discussion to move from them to random (now dependent) nc and n = nc + ns , whose joint distribution also follows from (11): P(nc , n|λ, μ) = Cnnc pnc qn−nc × P(n|ν).
(13)
where Cnnc = n!/nc !(n − nc )! is the binomial coefficient, P(n|ν) is the Poisson distribution, analogous to (11), and the following notations are introduced: ν=
σs σc λ σs μ σc λ + μ, p = ,q = , p + q = 1. σ σ σ ν σ ν
(14)
For a complete statistical description of the problem, it is necessary to choose an a priori joint distribution of the intensities λ and μ. We choose it as a mixture of two components ρ(λ|μ) = ωδ(λ − μ)℘ (μ) + (1 − ω)℘ (λ)℘ (μ),
(15)
where the weights ω and 1 − ω can be interpreted as the probability of the 0–hypothesis H0 , that λ and μ are the same (more generally, dependent) and, accordingly, as the probability of the alternative H0 that λ and μ are independent. It follows from this
Centre-Lateral Threshold Filtering as a Method
129
interpretation that ℘ (λ) / ℘ (μ) is the a priori unconditional probability distribution of count intensity λ / μ. Based on the generative model (13, 15), standard statistical methods can be used to obtain a posteriori distributions of λ and μ, given observed data + = nc − nσc /σ and n. Omitting intermediate transformations and approximations, here we present only the final result (details can be found in [19]): ρ(λ, μ| + , n) =
≈
σc σs κ 2 (1−ω) + (
+ )+ ω n
+ −ζ )2 exp − ( 2npq √ 2π npq
P( + ,n|λ,μ)ρ(λ,μ) P( + ,n)
× P(n|ν) × δ(λ − μ)℘ (μ) +
,
(1−ω) ω ℘ (λ)℘ (μ)
(16) where the likelihood ratio + n ( + ) of the hypotheses H0 and H0 for given + and n is
2+ σc σs P( + , n|H0 ) + κσ exp − ≈ n ( + ) = , (17) 2π nσ 2 2nσc σs /σ 2 P( + , n|H0 ) and κ is the value of a priori average count intensity λ (or μ), its characteristic scale, which corresponds to a rough approximation ℘ (λ) ≈ κ −1 . Using (16), the first moments (a posteriori means) of λ and μ can be found, which can be interpreted as their optimal a posteriori (MAP) estimates λ( + , n) and μ( + , n). Again, omitting intermediate transformations and approximations (see [19]), we present the final result: λ( + , n) =
n + σ
1 + ω 1−ω n ( + ) + 1
+ . σc
(18)
A similar result can also be obtained for μ( + , n), which, however, also follows from the identity σc λ + σs μ = n. In the tradition of classical coding theory, the estimate λ( + , n) (18) can be considered as optimal encoding of λ (and μ via μ = n/σs − σc λ/σs ), effectively combining registered data + , n with a priori model (15). In view of the relatively complex dependence of + n ( + ) (17) on n and + , the implementation of the encoding procedure (18) also looks somewhat complicated. However, if we roughly approximate + n ( + ) (17) by zero for large + and by infinity for small + , we get a quite simple encoding algorithm: √ n σ , + < D n, √ , (19) λ( + , n) = n
+ nc σ + σc = σc , + > D n √ suggesting that λ should be encoded by the average RF intensity n/σ at small + < D n and by the average intensity at the centre nc /σc at large value. The value D in (19), defining the boundary ∗+ of small and large + , can be found from the condition ω + ∗ 1−ω n + = 1 (see (17)): 2 ∗ √ ω σ σ σ σ c s c s 2 = ln n, (20) (κσ ) + 2 2 σ 1 − ω 2π nσ
130
V. E. Antsiperov and E. R. Pavlyukova
whence, if we put n ∼ κσ under the sign of the logarithm, we obtain approximately σσ ω σc σs c s (21) 2ln + ln + ln(κσ ) . D≈ σ2 1−ω 2π σ 2 The coding procedure (19) can be considered as a ridge regression of data [20], if we consider n/σ as a predictor of the estimate of the intensity λ of the centre counts, and δ+ = + /σc = nc /σc −n/σ as residuals for this assessment. The only difference of (20) from the LASSO procedure (for the “least √ absolute shrinkage operator”) proposed in [20] is the dependence of the threshold D n on the total number of counts n on the RF. The latter is related to the correlation of the Poisson noise with the image values (intensities I (x)). In this regard, procedure (19) can be referred to as a nonlinear centre-lateral threshold filter.
4 Conclusions The approach proposed in the paper, based on modeling the well-known mechanisms of human perception, in particular, the universal mechanism of lateral inhibition, turned out to be very promising. It opens up new possibilities for the synthesis of real nonlinear image filtering algorithms aimed at neuromorphic data compression. A special representation of images (sampling representations) developed for these purposes made it possible, on the one hand, to avoid problems associated with the size of representations, and, on the other hand, opened new possibilities for modeling a number of mechanisms of biological (human) intelligence. Note that the approach put forward can also be applied to the wider field of machine learning methods. It turned out, that the approach has numerous, non-trivial connections with such areas of machine learning as anisotropic diffusion methods, wavelet shrinkage and variational methods, which have proved to be the best tools in the field of residual neural networks [21]. Funding. Present research was carried out at the expense of budgetary financing within the framework of the State Order at the Kotelnikov Institute of Radio-Engineering and Electronics of the Russian Academy of Sciences (State Contract “RELDIS”).
References 1. Sejnowski, T.J.: The unreasonable effectiveness of deep learning in artificial intelligence. In: Proceedings of the National Academy of Sciences, vol. 117(48), pp. 30033–30038 (2020). https://doi.org/10.1073/pnas.1907373117 2. Hassabis, D., Kumaran, D., Summerfield, C., Botvinick, M.: Neuroscience-inspired artificial intelligence. Neuron 95(2), 245–258 (2017). https://doi.org/10.1016/j.neuron.2017.06.011 3. Fukushima, K.: Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics 36(4), 193–202 (1980). https://doi.org/10.1007/bf00344251 4. Hubel, D.H., Wiesel, T.N.: Receptive fields of single neurones in the cat’s striate cortex. In J Physiol. 148(3), 574–591 (1959). https://doi.org/10.1113/jphysiol.1959.sp006308. PMC1363130.PMID14403679
Centre-Lateral Threshold Filtering as a Method
131
5. Henley, C.: Foundations of Neuroscience, Open Pressbooks, East Lansing (2021) 6. Strisciuglio, N., Petkov, N.: Brain-inspired algorithms for processing of visual data. In: Brain-Inspired Computing: 4th International Workshop, BrainComp 2019, Cetraro, Italy, pp. 105–115. Springer-Verlag, Berlin, Heidelberg (2019). https://doi.org/10.1007/978-3-03082427-3_8 7. Antsiperov, V.E., Pavlyukova, E.R.: Neuromorphic image coding based on the partition of samples of counts by the system of receptive fields. In: XXIV International Scientific and Technical Conference “NEURO-INFORMATICS-2022”: collection of scientific contributions, pp. 14–24. MIPT, Moscow (2022). ISBN 978-5-7417-0823-1 (in Russian) 8. Antsiperov, V., Kershner, V.: Retinotopic image encoding by samples of counts. In: De Marsico, M., Sanniti di Baja, G., Fred, A. (eds.) Pattern Recognition Applications and Methods. In: ICPRAM 2021–2022. Lecture Notes in Computer Science, vol. 13822. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-24538-1_3 9. Scott, D.W.: Multivariate density estimation: theory, practice, and visualization, 2nd edn. Wiley, Hoboken, New Jersey (2015) 10. McLachlan, G.J., Lee, S.X., Rathnayake, S.I.: Finite mixture models. In: Ann. Rev. Stat. Appl., vol. 6(1), pp. 355–378 (2019). https://doi.org/10.1146/annurev-statistics-031017-100325 11. Murphy, K.P.: Machine learning: a probabilistic perspective. MIT Press, Cambridge, Massachusetts (2012) 12. Hubel, D.H., Wiesel, T.N.: Brain and Visual Perception: The Story of a 25-year Collaboration. Oxford University Press, New York (2004) 13. Schiller, P.H., Tehovnik, E.J.: Vision and the Visual System. Oxford University Press, Oxford (2015). https://doi.org/10.1093/acprof:oso/9780199936533.001.0001 14. Gauthier, J.L., Field, G.D., et al.: Receptive fields in primate retina are coordinated to sample visual space more uniformly. In: PLoS Biol, vol. 7(4), p. e1000063 (2009). https://doi.org/ 10.1371/journal.pbio.1000063 15. Mallat, S.G.: A theory for multiresolution signal decomposition: the wavelet representation. In: IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 11(7), pp. 674–693 (1989). https://doi.org/10.1109/34.192463 16. Allebach, J., Wong, P.W.: Edge-directed interpolation. In: Proceeding of the 3rd IEEE International Conference on Image Processing, vol. 2, pp. 707–710 (1996). https://doi.org/10.1109/ icip.1996.560768 17. Marr, D., Hildreth, E.: Theory of edge detection. In: Proceedings of the Royal Society B: Biological Sciences, vol. 207(1167), pp. 187–217 (1980). https://doi.org/10.1098/rspb.1980. 0020 18. Donoho, D.L., Johnstone, J.M.: Ideal spatial adaptation by wavelet shrinkage. Biometrika 81(3), 425–455 (1994). https://doi.org/10.1093/biomet/81.3.425 19. Antsiperov, V.: New centre/surround retinex-like method for low-count image reconstruction. In: Proceedings of the 12th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2023), SCITEPRESS, Lda, pp. 517–528 (2023). https://doi.org/10.5220/ 0011792800003411 20. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Royal Statist. Soci. Series B (Methodological) 58(1), 267–288 (1996). https://doi.org/10.1111/j.2517-6161.1996.tb0 2080.x 21. Alt, T., Weickert, J., Peter P.: Translating Diffusion, Wavelets, and Regularization into Residual Networks. In: arXiv:2002.02753.2020 (2020). https://doi.org/10.48550/arxiv.2002. 02753
Neural Networks and Cognitive Sciences
Permanent Sharp Switches in Brain Waves During Spoken Word Recognition Victor Vvedensky1(B) , Vitaly Verkhlyutov2 , and Konstantin Gurtovoy1 1 RNC Kurchatov Institute, Moscow, Russia
[email protected] 2 Institute of Higher Nervous Activity and Neurophysiology of RAS, Moscow, Russia
Abstract. We measured magnetic signals of the human brain during recognition of spoken words and permanently observed clear jumps in the rate of the signal amplitude changes. The curves can be always represented as a concatenation of piecewise linear time segments with abrupt changes of slope. Duration of these time segments is highly variable, though it follows common dependence for all segments of the whole record. The brain signals during execution of a cognitive task resembled relay system switching permanently when performing computations. We intend to reveal links between remote areas in the brain which synchronize moments of abrupt switching. Keywords: Background Cortical Activity · Brain Waves · Sharp Kinks · Word Recognition
1 Introduction Electric oscillations occur permanently in the human brain and the underlying neural processes can be monitored externally by electro- and magnetoencephalography – EEG and MEG. We studied perception of words using EEG [1] and observed that the brain oscillations change when the subject performs the spoken word recognition. Jumps of the period of oscillations were seen both in the beginning and the end of the recognition. These sharp changes were clearly visible for subjects with fairly regular alpha rhythm in the “eyes closed” state. The period of alpha waves was determined with high precision in large part because the tips of the waves were most often sharp. We replicated this EEG experiment, recording the brain waves using Electa Neuromag multichannel magnetometer. The full description of this MEG experiment will be given elsewhere, here we discuss one specific aspect of these magnetic measurements which, we believe, deserves special attention.
2 Methods Seven volunteers took part in the study. All subjects signed an informed consent. The study was approved by the local ethics committee of the Institute of Higher Nervous Activity and Neurophysiology of Russian Academy of Science and was conducted following the ethical principles regarding human experimentation (Helsinki Declaration). © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 135–140, 2023. https://doi.org/10.1007/978-3-031-44865-2_14
136
V. Vvedensky et al.
The measurements were made in the Center for Neurocognitive Research (MEG Center) at the Moscow City University of Psychology and Education. Magnetic brain responses were recorded with the sampling rate of 1000 Hz using a helmet-shaped whole head magnetometer (ElektaNeuromag MEG system, 306 channels). Native hardware filters with band pass 0.1–330 Hz were used and no additional filtering was applied to avoid distortion of the signal shape. General data processing was done using the Brainstorm software [2]. Stimuli and analyzed data (MEG, MRI) are available on the ZENODO resources [3]. During the measurement the participant was sitting in the magnetically shielded room. The eyes were closed to avoid blocking of the alpha rhythm. The stimuli were prerecorded words uttered by a human and delivered via earphones. The subjects were instructed to recognize the word they heard and to confirm recognition by a keystroke. Playback of the next word started 1 s after the button was pressed. In a single experiment every subject heard 40 words of the same sound duration and this measurement lasted for about 80 s. 8 different Russian adjectives were presented in this sequence, 5 times each in random order. Three different sets of 8 adjectives with different word duration (680, 830, 910 ms) were presented to each subject. The words used in the described experiment are shown in Fig. 1: kydpvy-curly, kpyqeny-twisted, petlwi-winding, vzanyknitted, kypqavy-crimpy, pleteny-braided. These words with 680 ms duration are semantically close. We analyzed off line time segments of the recorded signals corresponding to the word recognition process. In this paper we describe the behavior of the cortical source generating maximum signal over the head during the whole experiment. Let us call it leader source. It is located under the sensor, indicated in Fig. 1 by the arrow, showing the direction of electric current producing measured signal. A dozen of other sources of comparable strength are simultaneously active in other areas of the brain - they generate signals with different time course. These sources are less frequently active and generate smaller current, though their basic behavior is similar to that of the leader source, we describe below. This behavior is also common for all our subjects.
3 Results: Sharp Changes of the Signal Time Course Figure 1 shows magnetic signals recorded during 9 recognitions of spoken words. They were chosen from 40 recognition events during a single experiment and represent the whole set reliably. The signal course is individual for each event and recognition times are scattered in the considerable range 650 to 1250 ms. In many cases, the type of oscillation is far from regular rhythm. This irregularity has been reported in many studies of brain waves [4, 5]. Traditionally the background electrical activity of the brain is presented as a combination of different frequency components or rhythms (alpha, beta, gamma, delta). However, we cannot decompose our data into clearly distinguishable frequency bands on the basis of the duration of cycles. For this subject, in the sensor indicated in Fig. 1, the period of oscillations (measured as the time between successive apex points of the waves) was highly variable during the process of word recognition. If the numerous small peaks intermixed between the larger ones are not neglected, the distribution of the
Permanent Sharp Switches in Brain Waves
137
Fig. 1. MEG records of one subject during spoken word recognition. The subject listened to the words with closed eyes. Blue arrow indicates the sensor on the helmet where these most stable high amplitude signals were measured. Cyan strips show sound duration of the corresponding words and the record is cut off when the subject pressed the button confirming word recognition. The grid of vertical lines has a period of 100 ms. The record below at larger scale displays signal representation as a broken line with straight segments. Sharp changes in the slope of the line segments are clearly visible. The points are readings of the acquisition system.
138
V. Vvedensky et al.
oscillation periods is smeared over considerable range shown in Fig. 2. The corresponding frequencies cover smoothly 6 to 50 Hz range. This does not contradict the traditional division into habitual frequency ranges of brain oscillations, since power spectra take into account the amplitude of each cycle. We consider only the duration of waves.
Fig. 2. Number cycles in each 5 ms bin as function of the oscillation period for the data presented in Fig. 1. During the whole measurement 471 cycles were recorded. The solid line shows the best fit Gaussian distribution.
Close examination shows that the tops of the waves have sharp kinks and such kinks are present in the entire record and not only on the tips of the waves. Any of our curves can be represented as a broken line composed of straight line segments, as shown in Fig. 1. Ascending slopes are red, descending slopes – blue and intermediary slopes – green. Some kinks happen on the ascending or descending phase of oscillation. The process we observe is far from harmonic, but rather resembles abrupt switching of modes in a system of electronic relays. The representation of EEG signal as a concatenation of linear segments for engineering purposes is called piecewise linear approximation and was used in a design of brain-computer interfaces for robot control [6]. We believe that this decomposition is not just engineering trick, but reflects operation of the brain computing system. Any recorded signal curve consists of segments with definite slope, which remains stable until sharp change of the slope happens. Durations of these time segments are shown in Fig. 3-1– they are highly scattered and usually shorter than the half-period of the dominant alpha waves. This means that many single oscillation cycles contain several sharp kinks, as we see in Fig. 1. However, we often see cycles with only one kink on the top of the wave. These oscillations are clearly triangular, sometimes skewed. Figure 3-2 shows how the time course of the signal changes at every kink. We see that all these changes follow common dependence – usually ascend turns into descend of comparable rate and vice versa. The less numerous points in two other quadrants indicate the cases when only the rate is changed.
Permanent Sharp Switches in Brain Waves
139
Fig. 3. 1) Duration of each of 1164 segments as a function of the linear slope of the signal amplitude variation with time. The slope value is given in relative units. 2) Slope change in the end of each segment as a function of the slope value before the kink.
4 Discussion The observed splitting of continuous signal into linear segments is well in line with representation of the rhythmic brain activity as a sequence of stereotyped episodes [7, 8]. The sequence of these episodes is believed to have inherent piecewise stationary structure with abrupt transitions from one time segment to another. Our measurements extend the scale of this splitting to shorter time intervals. We observe that clearly visible switches occur not only during transition from one episode (could be several oscillation long) to another, but just inside each episode and even cycle. This provides additional information on what the monitored brain area does during a particular episode. This is especially useful, when the experiment is designed so that one can at least suggest what is the subject doing during the observed switches. The decision making is just an abrupt termination of a certain process in the brain. We permanently see simultaneous switches in different cortical locations and look forward to identify areas which tightly cooperate in decision making. Our subjects recognize spoken words although we believe that any other decision making experiment can be analyzed in terms of the cortical distribution of simultaneous switches. This study was supported by the Russian Science Foundation - Grant no. 23–7800011. The authors are grateful to Chernyshev B.V.and Prokofyev A.O. from the Center for Neurocognitive Research (MEG Center) at the Moscow City University of Psychology and Education and Martynova O.V. from the Institute of Higher Nervous Activity and Neurophysiology for their help in the experiment.
140
V. Vvedensky et al.
References 1. Vvedensky, V., Filatov, I., Gurtovoy, K., Sokolov, M.: Alpha rhythm dynamics during spoken word recognition. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) Advances in Neural Computation, Machine Learning, and Cognitive Research VI. NEUROINFORMATICS 2022. Studies in Computational Intelligence, vol. 1064, pp. 65–70. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-19032-2_7 2. Tadel, F., Baillet, S., Mosher, J.C., Pantazis, D., Leahy, R.M.: Brainstorm: A User-Friendly Application for MEG/EEG Analysis. Computat. Intelli. Neurosci. 2011 (2011). ID 879716. https://doi.org/10.1155/2011/879716 3. Verkhlyutov, V.: MEG data during the presentation of Gabor patterns and word sets. https:// zenodo.org/record/7458233. https://doi.org/10.5281/zenodo.7458233 4. Jones, S.R.: When brain rhythms aren’t ‘rhythmic’: implication for their mechanisms and meaning. Curr Opin Neurobiol. 40, 72–80 (2016). https://doi.org/10.1016/j.conb.2016.06.010. Oct 5. Cole, S.R., Voytek, B.: Brain oscillations and the importance of waveform shape. Trends Cogn Sci. 21(2), 137–149 (2017). https://doi.org/10.1016/j.tics.2016.12.008. Feb 6. Zhang, H.L., Lee, S., Li, X., He, J.: EEG self-adjusting data analysis based on optimized sampling for robot control. Electronics 9(6), 925 (2020). https://doi.org/10.3390/electronics9 060925 7. Neymotin, S.A., et al.: Taxonomy of neural oscillation events in primate auditory cortex. eNeuro 9(4) (29 July 2022). ENEURO.0281-21.2022: https://doi.org/10.1523/ENEURO.028121.2022 8. Kaplan, A.Y., Fingelkurts, A.A., Fingelkurts, A.A., Borisov, S.V., Darkhovsky, B.S.: Nonstationary nature of the brain activity as revealed by EEG/MEG: Methodological, practical and conceptual challenges. Signal Process 85, 2190–2212 (2005). https://doi.org/10.1016/j.sigpro. 2005.07.010
Cognitive Neuro-Fuzzy Control Systems Lev A. Stankevich(B) Peter the Great St. Petersburg Polytechnic University, Saint-Petersburg, Russia [email protected]
Abstract. The work is devoted to the problems of developing cognitive neurofuzzy control systems. Systems are considered in which the cognitive functions of predicting and classifying the states of the environment for the control object are implemented. It is shown that the existing classifiers can provide an accuracy of 60–80% for 4 classes of the states. A new type of classifier based on a neuro-fuzzy network has been proposed, which showed an accuracy of the state classification at least 80%. Examples of using the classifier into cognitive control system of robots are given. Keywords: cognitive functions · cognitive control · neuro-fuzzy networks · classification · prediction · robotic devices
1 Introduction Currently, there is a wide variety of assistive devices that help people do their job better and even replace them in certain types of work. Among them are robotic devices (RD), such as robotic manipulators, mobile robots, bionic prostheses, wheelchairs or exoskeletons [1]. Such devices can be directly controlled by a human operator using commands to perform separate simple actions. Supervisory control is also possible, when the operator only generates a task that the device performs autonomously. In any case, a person or a control system must perceive and evaluate the state of the environment, which is used for subsequent decision-making. A distinction can be made between static and dynamic states. In the first case, the environment does not change during the control period, and in the second case, it can change. In the case of dynamic states of the environment, control can be implemented using the cognitive functions of classifying and predicting states. These functions are formed by learning and make it possible to realize the perception of information with the possibility of predicting the development of events. This makes it possible to form behaviors taking into account such predictions and implement scenario actions of the device. Such systems can be called cognitive control systems [2, 3]. The implementation of cognitive functions is supposed to be based on cognions – real-time learning components that accumulate knowledge in an associative form and make decisions based on them by associative inference [3]. The cognions are built using neural networks or neurological means. Some variants of cognions were also implemented on spiking neuromorphic networks [4]. According to the author, cognions © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 141–148, 2023. https://doi.org/10.1007/978-3-031-44865-2_15
142
L. A. Stankevich
on neuro-fuzzy networks are the most suitable for control purposes. Different variants of such networks are used in solving different problems of RD control [5, 6]. The research of the author and his colleagues in this direction showed that the best results were obtained by cognions based on network and cluster models [3]. Cognitive control systems make it possible to implement the complex behavior of RD when working on the instructions of an operator in dynamic environments. A special option is the contactless control of devices that can provide a better quality of life for people with disabilities [1, 7]. Direct or supervisory control of devices that help immobilized people can be implemented through brain-computer interfaces (BCIs). In this case, non-invasive BCIs based on electroencephalography (EEG) are most often used [8]. Such BCI-EEGs make it possible to detect some imaginary human commands and to control, for example, wheelchairs [9, 10]. The purpose of this work is to develop a cognitive control system of RD based on cognions built on neuro-fuzzy networks. Further, the paper considers the principles of cognitive control of RD in according to the state of the environment. It is shown that the cognitive functions of classification and prediction can be implemented on cognions based on a cluster neuro-fuzzy model. An example of the use of such cognions to control a robot when avoiding a dynamic obstacle is given. Another example is related to the implementation of the BCI-EEG, which performs the classification of spatio-temporal patterns of imaginary commands to perform movements during contactless control of the robot.
2 Cognitive Control of Robotic Devices When controlling the RD directly, the operator observes the environment himself or perceives it through sensors i.e. registers the environment and classifies its state. In accordance with this, the brain generates control commands that are implemented in a contact way, for example, through remote controls, or contactless, for example, through BCI-EEG. In the case of supervisory control, the system, which received a task from the operator in contact or contactless mode, perceives the environment with the help of technical vision, classifies it, and generates control signals so that the device performs purposeful actions according to the scenario specified by the operator. The state of dynamic environment changes in time, and in order to control the RD, it is necessary not only to classify the current states, but also to predict their possible changes in time. The dynamic state of the environment is determined by a spatio-temporal pattern (STP) that integrates spatial patterns (SP) generated by sensor signals in a certain time interval at several points in the environment. In this case, the most effective is the cognitive control of the device, based on the implementation of the cognitive functions of classifying and predicting changes in the state of its environment, as well as the formation of commands taking into account the results of the implementation of these functions. Figure 1 shows that cognitive control using environment states can be implemented by several modules. The main one is the module that implements the cognitive function of STP classification. This module receives information from the module that converts sensor signals into a sequence of SPs. The result of the work of the classification module, in turn, is used by the module that implements the mapping of the state of the
Cognitive Neuro-Fuzzy Control Systems
143
environment corresponding to the current situation. Typically, the environment state stores situations from previous times so that the cognitive predictive function can decide how the environment state will change over time. Learning
State of environment
Sensors
Learning
S
S Converting signals in set of SP
Cognitive function of classification
Mapping state of environment
Cognitive function of prediction
Transformation of STP to control commands
Robotic devices
Fig. 1. Scheme of cognitive control
3 Neuro-Fuzzy Implementation of Cognitive Functions In this paper, it is proposed to realize the cognitive functions of classification and prediction using the cognions on model of neuro-fuzzy networks described in [4]. The model is based on the use of a fuzzy logic approach designed to work with inaccurately defined data, and a neural network approach to automatically tune functions by examples. Combining these approaches allows us to automatically create a network of rules based on function examples. Note that the cognion on the cluster model is not initially intended for classification of time sequences [4]. This shortcoming is eliminated in the proposed version of the classifier with a preliminary converting the set of input signals into SP states with their binding to time. This is how the STP of the state is formed, which is fed to the inputs of the classifier. The advantage of the cluster model is that it allows increasing the efficiency of STP classification using neuro-fuzzy network and clustering. The scheme of the STP classifier with one cognion is shown in Fig. 2. The main functions of the classifier components are: (1) converting continuous input signals into sets of SP tied to time; (2) associative transformation of STP; (3) clustering solutions after training; (4) evaluation of membership to the class of STP; (5) formation of the class index of STP. Learning
x1(t)
Converting in set of SP
(x1,x2,t1)
Sensors
x2(t)
Associative transformation of STP
Clustering solutions
Evaluation of membership to the class of STP
Forming class index of STP
Converting in set of SP
(x1,x2,tn)
Fig. 2. Scheme of state classifier
The main functions of the classifier components are described below.
Class index
144
L. A. Stankevich
The converting the input signals is performed for a selected time interval, in which the STP of the environment state should be determined. This interval is divided into n segments, in which SPs are formed, tied to time. The simplest way is to average the recorded values in the segment for each of the input signals. The resulting sets of SPs for all segments of the selected time interval form the STP of the state supplied to the input of the associative transformer. The transformation of each STP is aimed at evaluating the conformity of the current STP to certain environmental states. It is produced by an activator on a neuro-fuzzy network, which performs a non-linear transformation of multiple inputs into a single output. This transformation formed by training has the general form: n m |wi μi,j (Xj ), Y = sign(wk ) i=1
j=1
where n is the number of inputs; m is the number of examples; wi is the weight coefficient for the i-th term; k is the number of the term, the value of which was decisive in the union operation. More specifically, each multidimensional term corresponding to example i of m examples is displayed by a set of membership degrees {μi ( j) (x), j = 1,…,n}, which is processed by a fuzzy-logical intersection operation (the minimum value from set). The resulting value μi ( j) (x) is corrected by the weight (multiplied by the value wi , which was generated when setting up the mapping). The corrected values y( j) = wi μi ( j) (x), j = 1,…,m are processed by a fuzzy-logical union operation (the maximum value from the set is determined), which results in the value y. Such mapping is implemented by the Trans(x 1 ,…,x n ) procedure, which makes it possible to determine the value of the function by sequentially performing operations in accordance with given formula. You can use the standard triangular membership function, but this can lead to gaps between examples and, as a result, to a significant deterioration in the results. To solve this problem, one can use the approach when neighboring terms are connected in such a way that the lower and upper boundaries of a term depend on the centers of neighboring terms. Then the membership function can be represented as follows: ⎧ ckj −Xj ⎪ ⎪ ⎨ wk + (wi − wk ) ckj −cij , for cij < Xj < ckj μij (X ) =
c −X wi + (wi − wl ) cljlj −cijj , for clj < Xj < cij , ⎪ ⎪ ⎩ 0 , else
where k is the number of the upper boundary term, l is the number of the lower boundary term. The model is trained by the normalization procedure. Having some training set {(Xi , Yi )}ni=1 , we initialize term centers x i and term weights yi . After that, it is need to merge the terms. This requires a procedure of normalization. Learning in such a model is carried out by setting the weight coefficients of the terms and the parameters of the membership functions that determine their boundaries and center. Normalization is performed in two stages. First, for each of the examples, a multidimensional term is formed, the vertex of which is located at the point where the example is specified. The term boundaries are specified in the form of multidimensional rectangles in the space
Cognitive Neuro-Fuzzy Control Systems
145
of system inputs. At the first stage, the boundaries of all rectangles coincide with the boundaries of the input parameters setting area. The second stage is to normalize the term boundaries in such a way that all examples are processed correctly. During initialization, the values lij , cij , hij and wi are reset to zero with a known number of input variables n and examples m. Next, the tuning parameters are initialized with the following values:
After setting all the examples, the correction of the boundaries of the terms is carried out, the purpose of which is to take into account the mutual restrictions imposed by the terms on each other. Correction of term boundaries is implemented by performing the following procedures: - calculating the function of processing input information Trans(x 1 ,…,x n ); - calculating Ind(x 1 ,…,x n ) – the number of the example that gave the maximum output in the fuzzy-logical union operation when implementing the procedure Trans(x 1 ,…,x n ) for each example from the set {x i → yi ; i = 1,…, m}; - correcting the parameters of membership functions when entering each example. The following is the pseudocode of the normalization algorithm.
As a result, the associative mapping of the required function is specified, i.e., a set of parameters of membership functions of terms corresponding to the examples and weight coefficients l ij , hij , cij , wj dl i = 1,…,m and j = 1,…n is obtained. Typically, the training set is quite redundant, and in order to optimize memory and resources, it is necessary to reduce the number of examples stored in the model. This requires clustering examples. For this, an agglomerative clustering procedure is used, which ends when the distance between the resulting clusters does not decrease to some given value. A set of SPs is supplied to the inputs of the cognion with binding of each SP to the time t i , and the current value of the output parameter yi is formed at the output of the cognion. Next, an assessment is made of the proximity of this value to the given value
146
L. A. Stankevich
in the cluster of examples by which the cognion was trained. If this estimate is higher than the specified value, then the STP class index is formed.
4 Mobile Robot Control Experiments To determine the possibility of controlling the movement of a mobile robot according to the considered control system, two experiments were carried out. In both experiments, a control system was used based on the states of the environment, built on cognions that implement the functions of predicting and classifying states in accordance with the scheme shown in Fig. 1. In this case, only two information channels are used at the input of the system and 4 cognions, which are built according to the scheme shown in Fig. 2. Each cognition is trained to evaluate the degree of recognition of one of the 4 STP environment states. In the process of work, the STP of the state with the maximum recognition degree is found, and for this state the index of the control command class is formed. These commands are further used in the robot’s control unit to implement its movement by one step in the selected direction. In the first experiment on controlling the movement of a robot in a dynamic environment, a situation was considered when the robot moves along a given trajectory and a person appears to the right of it, moving across. It is required to avoid a collision with him, using information about the change in his position during the movement. To make decisions about the possible movements of the robot, 4 states of the environment were defined: (1) when moving in a given direction, a collision is expected → Stop command; (2) no collision is expected when moving in the given direction → Forward command; (3) to pass in front of an obstacle, turn left → Left command; (4) to get around the rear obstacle, to turn right → Right command. With a known speed of the robot movement vr in a given direction, the STP of each state was determined by two parameters obtained at each moment of time from the sensor system: the angle of direction α to the obstacle in relation to the given direction and the speed of the obstacle vo . The cognions were tuned to recognize each STP by reinforcement learning [11]. To obtain examples corresponding to each of the states, the movements of the robot and the obstacle were simulated with a random change in the recorded parameters of the obstacle in time. The resulting sets of examples, in which there were no collisions, were further used to train the cognions to recognize the corresponding STPs. Figure 3 shows solutions for controlling the robot in order to avoid a collision with a walking person. Correct decisions were achieved in 84 cases with 100 attempts to control when changing the speed of person movement up to 1.2 m/s. In the second experiment on contactless control of a mobile robot using BCI-EEG, a variant of direct control by STP of states of brain activity recorded by EEG signals over a certain period of time from two channels was considered. EEG signals were recorded in real time using the Neuroplay-8 EEG device from Neurobotics with 8 electrodes (http:// www.neuroplay/). EEG signals were recorded in the frequency band 0.53 Hz - 30 Hz. When analyzing EEG recordings, artifacts such as eye movements, slow and fast waves, and high-amplitude signals were excluded. Artifact-free EEG signals recorded from the sensorimotor area of the cerebral cortex were used for classification: two channels were selected for each class of movements, in which the most informative signals were observed.
Cognitive Neuro-Fuzzy Control Systems
147
Person Vо Left avoiding collision
Vr
α Right avoiding collision Робот Робот Robot
Fig. 3. Avoiding collision with dynamic obstacle
First, the cognions were trained in 4 classes of imaginary movements: raising the right or left arm, pressing the foot of the left or right leg. At beginning, the subjects performed real movements in a given rhythm, and then only imagined movements in the same rhythm. The training procedure took about an hour. Testing the trained classifier showed an average accuracy of 80%, and a maximum accuracy of 85%. To control the model of a mobile robot in the Gazebo environment [12], a special software module was used that converts the recognized imaginary movements into control commands: pressing the foot of the left leg → Forward, pressing the foot of the right leg → Stop, raising the right arm → Right, raising the left arm → Left. For some time, the subjects learned to control the robot model by imagining the movements corresponding to the commands. The first attempts to control the movement of the robot showed that for the confident passage of the track, it was not enough to recognize one attempt at imagining the selected action, and averaging of three attempts was required. The experiment involved 5 healthy subjects. Testing showed that all the subjects, after several training sessions, successfully controlled the robot.
5 Conclusion The development and application of control systems based on the state of the environment is currently relevant, primarily in the field of mobile robotics. In order to improve the efficiency of controlling in such systems, it is proposed to implement the cognitive functions of classification and predicting in a neuro-fuzzy basis. The considered version of the associative neuro-fuzzy transformer on the cluster model allows for effective solving the problem of classifying spatio-temporal patterns, with predicting changes in the state of the environment of the control object. This makes it possible to control the movement of robots in dynamic environments with significant changes in the state of the environment over time. The proposed version of the classifier on a neuro-fuzzy network showed the accuracy of state classification not lower than 80%. An example of using the developed classifier
148
L. A. Stankevich
in the control system for avoiding collision of mobile robot with a moving obstacle has given good result. Another example has showed the possibility of using the same version of the classifier in the brain-computer interface of a contactless control system for a mobile robot. Acknowledgements. The research is performed with support of Russian Science Foundation # 23-21-00287, https://rscf.ru/en/project/23-21-00287.
References 1. Gundelakh, F., et al.: Application of brain computer interfaces in assistive technologies. In: Proceeding of SPIIRAS. Vol. 19, No. 2, pp. 277–301. Saint-Petersburg (2020). (in Russian) 2. Stankevich, L.A.: Cognitive structures and agents in control systems of intellectual robots. News of artificial intelligence 1, 41–55 (2004). (in Russian) 3. Stankevich, L.A.: Cognitive systems and robots, p. 631. Monograph. Polytechnic University Press, Saint-Petersburg (2019). (in Russian) 4. Gundelakh, F.V., Stankevich, L.A.: Robotic devices control based on neuromorphic classifiers of imaginary motor commands. Studies in Computational Intelligence, 1064 SCI, 71–76 (2023) 5. Kruglov, V.V., et al.: Artificial neural networks. Theory and Practice. Moscow. Hot LineTelecom, 382 (2002). ISBN 5-93517-031-0 (in Russian) 6. Uchino, E., Yamakawa, T.: Soft Computing Based Signal Prediction, Restoration, and Filtering. In: Ruan, D. (ed.) Intelligent Hybrid Systems: Fuzzy Logic, Neural Networks, and Genetic Algorithms, pp. 331–351. Springer US, Boston, MA (1997). https://doi.org/10.1007/ 978-1-4615-6191-0_14 7. Gundelakh, F., Stankevich, L., Kapralov, N., Ekimovski, J.: Cyber-Physical System Control Based on Brain-Computer interfaces. In: Arseniev, D.S., et al. (eds.) LNNS 95, pp. 1–12. Springer Nature, Switzerland AG (2020) 8. Kapralov, N.V., Nagornova, Zh.V., Shemyakina, N.V.: Methods for the classification of EEG patterns of imaginary movements. Informatics and Automation 20, 94–132 (2021). https:// doi.org/10.15622/ia.2021.20.1.4 (in Russian) 9. Mishra, S., et al.: Soft, conformal bioelectronics for a wireless human-wheelchair interface / Biosens. Bioelectron. 91, 796–803 (2017). https://doi.org/10.1016/j.bios.2017.01.044 10. Smart Wheelchairs and BCI: Mobile Assistive Technologies In: Pablo D. (ed.) Academic Press. Elseivier (2018) 11. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction; MIT Press: Cambridge. MA, USA (2018) 12. Ackerman, E.: Latest Version of Gazebo Simulator Makes It Easier Than Ever to Not Build a Robot. IEEE Spectrum (2016)
Neurocognitive Processing of Attitude-Consistent and Attitude-Inconsistent Deepfakes: N400 Study Eliana Monahhova(B) , Alexandra N. Morozova, Dmitry A. Khoroshilov, Dmitry O. Bredikhin, Anna N. Shestakova, Victoria V. Moiseeva, and Vasily A. Klucharev National Research University “Higher School of Economics”, Russian Federation, Moscow, Russia [email protected]
Abstract. The project examined behavioral and electrophysiological brain responses to auditory deepfakes that were created for two well-known speakers and investigated participants with pro-vaccination and anti-vaccination attitudes. Our analysis focused on EEG activity taking into account congruence or incongruence of internal attitudes and the degree of analytical thinking, need for cognition and conformity of participants. We found that the level of trust to deepfakes was significantly influenced by the interaction between internal attitudes and speaker and need for cognition. The higher negative evoked brain response, alike to the N600 component, was observed for anti-vaccination group in mismatch to public opinion of the speaker. Keywords: artificial intelligence · disinformation · deepfake · trust · EEG · ERP · N400 · N600
1 Introduction Nowadays the mass distribution of fake content has acquired a significant scale and spread to various topics, including politics, economics, social problems, nomocracy, etc. One modern type of fakes is a technology synthesizing video and audio contents using a methodology of artificial intelligence (AI), often referred to as deepfake. With the development of neural networks, such method of transforming faces and voices to fakes has become very popular [1]. It likewise spread to cognitive sciences, but the amount of deepfake research in this field is rather limited. The current study analyzes behavioral and brain responses to novel audio deepfakes dedicated to the topic of COVID-19 vaccination in Russia. Preciously, we tested whether congruence of internal attitudes affects the level of trust to deepfakes and elicits event-related potential (ERP, namely N400) at the end of sentences uttered by deepfakes violating participants’ internal attitudes and public opinion of deepfake speakers. Thus, we assumed that the amplitude of the N400 component [2] should be on average more negative during the mismatch between the internal attitudes of the participants and © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 149–156, 2023. https://doi.org/10.1007/978-3-031-44865-2_16
150
E. Monahhova et al.
the public opinion of the (deepfake) speaker as compared to during the match between the internal attitudes of the participants and the public opinion of the (deepfake) speaker. To observe such reaction, we used electroencephalography (EEG). As for the behavioral hypothesis, we suggested that participants should believe more in audial messages of deepfakes, that correspond to their internal attitudes. We additionally focused on the effect of analytical thinking, need for cognition and conformity affecting the level of trust in deepfakes. To study behavioral aspects, we integrated Cognitive Reflection Test (CRT) [3], Need for Cognition Scale (NFCS) [4] and Conformity Scale (CS) [5].
2 Methodology Participants. Based on the the power analysis (effect size = 0,25) by G*Power tool, 51 right-handed adult participants were recruited (26 women, 18–35 years old, median age = 21) by online advertisement. Half of the selected participants (23 people) belonged to the anti-vaccination group having strong attitudes against COVID-19 vaccination in Russia, the other half of the selected participants (28 people) belonged to the provaccination group supporting COVID-19 vaccination in Russia. All participants had normal or corrected-to-normal vision, normal hearing, studied at the university or had at least one completed higher education in Russia, none of them obtained a history of neurological impairments. Importantly, the participants had to be naïve and not aware of the fact that the experiment includes deepfake materials. During the data analysis 11 participants were excluded due to extremely noisy EEG data. Excluding them both from the behavioral and the EEG analyses assured that the results were based on the same consistent sample. The final set of participants consisted of 40 people (18 females, median age = 21). The pro-vaccination group consisted of 25 people, the anti-vaccination group – of 15. Stimuli. Two voice deepfakes were generated by neural networks, using text-tospeech (TTS) algorithm. They were based on the real speakers’ voices representing two opinion leaders with different views towards vaccination against COVID-19 virus in Russia. Audio deepfakes, in turn, had to broadcast positions, which were opposite to speakers’ public opinion. Each voice deepfake involved 40 experimental phrases (i.e. “ cqita, qto nado ppivivatc ot gpippa i dpygix cezonnyx vipycnyx zabolevani, a ot koponavipyca ne nyno! I believe that it is necessary to get vaccinated against flu and other seasonal viral diseases, but you do not need to get vaccinated against coronavirus!”) with the unexpected endings (incongruent stimuli) and 40 control phrases (congruent stimuli), i.e. “Dae tot e gpipp - to covcem dpygoe delo, ne to, qto kovid. Ot nego i dpygix cezonnyx zabolevani my ppivivaemc davno, to pontno, i vot takim vakcinam dovept nyno/Flu is a completely different matter, not like covid. We have been vaccinated against it and other seasonal diseases for a long time, this is understandable, and we need to trust such vaccines”. The experimental and control phrases were presented pseudo randomly, so that the participants’ perception would not be biased. The experimental and control sentences were likewise accompanied by filler-phrases, added with the aim of transforming the speech of each speaker into logical and consistent narrative.
Neurocognitive Processing of Attitude-Consistent
151
To create such materials, we utilized the voice recordings of both speakers to (i) create appropriate spectrograms from them and (ii) train the neural network to convert the predesigned text into deepfake voice.
Fig. 1. The diagram of TTS and speech synthesis system (Text processing/ Synthesis/ Speaker’s coding/ Vocoder).
The transformation model consisted of four neural networks. First, the initial text was converted by grapheme-to-phoneme network (G2P). That is the process of applying rules to generate a pronunciation for a particular word, in turn, to create a pronunciation dictionary for the neural network [6]. Next, the speech was transformed to the attributes’ vectors (presented in numbers). Later, according to the tensor synthesizer (see Fig. 1), the model created a spectral representation of the sound, which was invoked into a new recording after a separate model (vocoder) occurrence. To encode the deepfake speaker’s speech, we applied the pre-trained CorentinJ (Real-Time-Voice-Cloning) model, and for G2P particularly – the retrained Russian version (Russian_G2P). The main target of a voice deepfake creation was to retrain the synthesis according to the speaker code of the required version of the output spectrogram. Procedure. Before the experiment participants were instructed to watch several videos on the YouTube-platform about the (deepfake) speakers, outlining their public opinion about the COVID-19 vaccination topic. Later, based on the inclusion criteria, participants were told that the goal of the research was to listen to the audio recordings of the opinion leaders about the COVID-19 vaccination topic and give their own feedback about such phenomenon and its topicality in the modern world. Thus, the experimental task consisted of two audial deepfakes, generated by AI. Participants were asked to listen to the audial materials carefully, as they would have to answer questions about the materials’ plot in the end of the experiment. At the end of each material, the participants had to pass a 7-point scale to indicate how much they agreed with the following statements: “The words of the previous speaker seemed convincing to me”, “I would share the information of the material with friends and relatives”, “The previous speaker’s position coincides with my internal attitudes”, “I would believe the previous speaker because of his authority”, “Words of the previous speaker caused some mistrust” and further assess their level of trust by the sum of points.
152
E. Monahhova et al.
Questionnaires. At the end of the deepfake session, the participants were asked to complete three questionnaires, one after another: questionnaire #1 – CRT test, probed critical and analytical thinking, questionnaire #2 – NFCS-test tested need for cognition and questionnaire #3 – CS-test assessed a desire to either be guided by individual’s own experience and be a leader or follow the majority. After passing all the stages of the experiment, participants had to undergo a debriefing procedure. EEG recording. During the deepfake materials listening, EEG signals were recorded with 32 electrodes, standard 10–20 montage, including 2 referent electrodes located behind ears. The sampling rate was set to 500 Hz. Electrode impedances were kept below 15 k. We were mostly focused to see the amplitude of the N400-like effect over 5 main electrodes (Oz, O1, O2, P4, P7) as we have plotted the grand-averaged difference wave among all channels for both deepfake speakers (see Fig. 2 for Speaker X & Speaker Y 1 ) and detected relatively significant differences on the provided electrodes in the time-zone from 600 to 700 ms after stimulus onset. Additional electrodes T3 and T4 were placed below the left eye and lateral to its outer canthus to control for vertical and horizontal eye movements and blinks. All offline signal processing and artefact correction of EEG data was performed in MNE Python (v1.0.3). Prior to the analysis, the data was manually inspected to reject noisy segments. Thus, the samples of these channels were interpolated based on the signals of the good sensors around them. To suppress artifacts caused by eye movements and blinks we used Independent Component Analysis (ICA) involving the infomax algorithm. The data was band-pass filtered (1–45 Hz) and notch-filtered at 50 Hz and then epoched into segments starting 500 ms before the stimulus onset and ending 1000 ms after. The prestimulus interval of − 500 to 0 ms was used as a baseline.
Fig. 2. Grand-averaged difference wave (incongruent minus congruent trials) among all channels (Speaker X & Speaker Y). The zero timepoint indicates the onset of the stimuli. X-axis – time (s), Y-axis – voltage (µV).
Statistical analysis. To obtain individual ERPs, all trial types were averaged separately for each experimental group (pro-vaccinator or anti-vaccinator groups), stimulus type (congruent trial or incongruent trial), and electrode (Oz, O1, O2, P4, P7). The grand 1 Speaker X – Russian doctor, in real life supporting vaccination, in deepfake – opposing it.
Speaker Y – Russian actress, in real life is against vaccination, in deepfake – supporting it.
Neurocognitive Processing of Attitude-Consistent
153
ERPs were calculated for all trial types. The difference waves were also calculated by subtracting ERPs to the congruent stimuli from ERPs to the incongruent stimuli. The significant differences of the evoked response in different experimental conditions and correlations between neurophysiological and behavioral data were calculated in two ways – by integrating permutation F-test with 1D cluster level and multi-factor Analysis of Variance (ANOVA). ANOVA assumptions (namely, homogeneity of variances and normality) were higher than 0,05 (within the normal range). Homogeneity of variances was controlled by Levene’s Test for Equality of Variances. Normality, in turn, was maintained by Shapiro-Wilk Normality Test.
3 Results Having conducted the statistical analysis by using permutation F-test with 1D cluster level and ANOVA method, we investigated the effects of attitude-consistent and attitudeinconsistent target words within deepfakes on behavioral and neurophysiological level. We investigated whether there are significant differences in the N400-like amplitude indicators between congruent and incongruent stimuli for both deepfake speakers (speaker X and speaker Y). We also explored the interaction between the amplitude differences for both speakers in congruent and incongruent trials, speakers (speaker X or speaker Y), participants’ internal attitudes to vaccination from COVID-19 and individual differences in CRT, NFCS and CS tests, studied how participants’ internal attitudes and individual parameters (CRT, NFCS, CS) may influence the level of trust to the deepfake materials. Permutation F-test with 1D cluster level results. We conducted permutation F-test with 1D cluster level for the main part of the experiment (number of permutations = 1024, number of jobs = 1, p-value < 0,05). The participants were divided into two groups by the attitude to vaccination factor (anti-vaccination group and pro-vaccination group). We compared amplitudes between congruent and incongruent target words for both speakers (speaker X & speaker Y) in two groups (anti-vaccination group and pro-vaccination group) separately. Anti-vaccination group. As for the anti-vaccination group, we have confirmed significance of the differences between ERPs to congruent and incongruent words on the timestamp of 600–750 ms with higher negative ERP component in the response to the incongruent words for speaker X, indicating a violation of expectations based on his public opinion (see Fig. 3). However, for speaker Y we found no significant differences between ERPs to congruent and incongruent words. Pro-vaccination group. We found no significant differences between ERPs to congruent and incongruent words for pro-vaccination group for both speaker X and speaker Y. ANOVA results. We investigated whether participants’ attitudes (Factor 1), speakers in the deepfake (either speaker X or speaker Y, Factor 2) and the interaction of these factors (Factor 1 × Factor 2) have a significant impact on the dependent variable – the N400-like difference wave amplitude (incongruent versus congruent trials for speaker X and speaker Y, separately) in time window from 600 to 750 ms, as Fig. 3 suggested. We subsequently added scores in CRT (Factor 3), NFCS (Factor 4) and CS (Factor 5) tests as additional factors. Thus, we investigated whether such differences were statistically
154
E. Monahhova et al.
Fig. 3. Grand-averaged ERPs to congruent trials (line 1), incongruent trials (line 2) for speaker X in anti-vaccination group. X-axis – time (s), Y-axis – voltage (µV). Threshold of 4.18; min = 0,000005, max = 14,17.
significant by conducting multi-factor ANOVA (the N400-like difference amplitude as dependent variable (DV); attitude, speaker and attitude × speaker, CRT, NFCS and CS scores as independent variables (IV), see Table 1). Table 1. The results of first ANOVA analysis (DV – the N400-like difference amplitude) Factors
Df
Sum Sq
Mean Sq
F value
P value
Attitude
1
3,71
3,71
0,46
0,5
Speaker
1
48,32
48,32
5,97
0,02
CRT
1
9,22
9,22
1,13
0,29
NFCS
1
4,78
4,78
0,59
0,44
CS
1
47,87
47,87
5,91
0,02
Attitude × Speaker
1
55,33
55,33
6,83
0,01
Residuals
73
591,12
8,1
Thus, the factors speaker (Factor 2) and interaction between attitude and speaker (Factor 1 × Factor 2) are significant for the N400-like amplitude difference (p = 0,02 for Factor 2; p = 0,01 for Factor 1 × Factor 2). In addition, the level of conformity (CS score) likewise demonstrated its significance on the dependent variable (p = 0,017 for Factor 5). Next, we observed whether the interaction between attitudes and speaker (Factor 1 × Factor 2) and CRT (Factor 3), NFCS (Factor 4) and CS (Factor 5) tests scores affect
Neurocognitive Processing of Attitude-Consistent
155
the level of trust (7-point scale score) to the provided materials (see Table 2). Table 2 demonstrates that interaction between attitude and speaker (Factor 1 × Factor 2) and need for cognition (NFCS, Factor 5) significantly affect the level of trust to each deepfake (either speaker X or speaker Y). Table 2. The results of second ANOVA analysis (DV – the trust score) Factors
Df
Sum Sq
Mean Sq
F value
P value
Attitude
1
65,33
65,33
1,58
0,21
Speaker to trust
1
994,05
994,05
24,07
5,48
NFCS
1
680,32
680,32
16,47
0,0001
CS
1
0,76
0,76
0,02
0,89
CRT
1
1,59
1,59
0,04
0,84
Attitude: Speaker to trust
1
134,67
134,67
3,26
0,08
Residuals
73
3015,27
41,31
Next, we calculated the mean level of trust to speaker X and speaker Y for both groups – pro-vaccination group and anti-vaccination group. Participants from antivaccination group tended to trust speaker X more (mean = 21,53), than speaker Y (mean = 11,13). Though, pro-vaccinators likewise trusted speaker X more (mean = 20,72), than speaker Y (mean = 15,68). Overall, our intermediate results can suggest that the behavioral hypothesis has been partly confirmed. We initially suggested that people should believe more in audio deepfakes, matching their internal attitudes, and the multi-factor ANOVA-analysis showed that anti-vaccinators believe to speaker X, who was against the vaccination in deepfakematerial, significantly more, than to speaker Y. Additionally, pro-vaccinators believe in speaker X significantly more, than to speaker Y. Thus, the level of trust is moderated by the interaction between attitudes and speaker and need for cognition.
4 Conclusion We found that both groups (pro- and anti-vaccinators) believed more to deepfake of speaker X, than to deepfake of speaker Y. Level of trust was significantly affected by the interaction between attitudes and speaker and need for cognition degree. Anti-vaccinators showed a significantly high negative evoked response to incongruent target words of the deepfake with the speaker X, alike to the N600-like effect. Such ERP is likewise responsible to words and auditory stimuli reflecting semantic and attitude mismatch [7]. The ERPs correlated of the expectation violation was affected by the speaker, internal participants’ attitudes and conformity level.
156
E. Monahhova et al.
Funding. The work is financially supported by Strategic Project of HSE TK-109 "Human autonomy: theoretical modeling of proactive action in teaching transformational processes in socioeconomic reality and analysis of the mechanisms of individual stability in the field of perception of misinformation and psychological costs".
References 1. Thies, J., et al.: Face2face: Real-time face capture and reenactment of rgb videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2387–2395 (2016) 2. Kutas, M., Federmeier, K.D.: Thirty years and counting: finding meaning in the N400 component of the event-related brain potential (ERP). Annual Review of Psychology 62, 621–647 (2011) 3. Frederick, S.: Cognitive reflection and decision making. J. Econ. Perspec. 19(4), 25–42 (2005) 4. Cohen, A.R., Stotland, E., Wolfe, D.M.: An experimental investigation of need for cognition. The J. Abnor. Soci. Psychol. 51(2), 291 (1955) 5. Mehrabian, A., Stefl, C.A.: Basic temperament components of loneliness, shyness, and conformity. Soci. Beha. Person. Int. J. 23(3), 253–263 (1995) 6. Rao, K., et al.: Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4225–4229. IEEE (2015) 7. Cummings, A., et al.: Auditory semantic networks for words and natural sounds. Brain research 1115(1), 92–107 (2006)
Real-Time Movement-Related EEG Phenomena Detection for Portable BCI Devices. Neural Network Approach A. Kh. Ekizyan(B) , P. D. Shaposhnikov, D. V. Kostulin, D. G. Shaposhnikov, and V. N. Kiroy Research Center of Neurotechnologies, Southern Federal University, Prospekt Stachki 194, Rostov-on-Don 344090, Russia [email protected]
Abstract. In recent years, interest for brain computer interfaces (BCI) and their potential applications has been grown. However, despite their potential benefits, there are still many challenges which should be solved before BCIs can be widely used outside of laboratory conditions. One of the key issues is the real-time discrimination of movement- related EEG phenomena, which is essential for the use of portable EEG devices in everyday life. In this study different machine learning approaches with preliminary statistical and spectral feature extraction were compared in classification of movement-related artifacts. Dataset in this research was obtained from experiment with portable EEG of our development. Tested methods demonstrated high accuracy up to 80 percent in 7-classes discrimination task.
Keywords: brain-computer interface machine learning · CNN
1
· portable EEG · movements ·
Introduction
The non-invasive brain-computer interfaces (BCI) make a direct communication pathway between the brain’s electrical activity and external devices using EEG signals. Existing today BCI systems are mostly developed and handled in laboratory conditions. They have several properties which made them difficult for everyday using [1]. One of the problems of portable EEG registration is different distortions of signal called artifacts which can be caused by physiological and technical reasons. Main EEG artifacts are resulted by movement activity such as rotating of head, eye blinking, tightening of teeth, etc. Identification of movement-related artifacts is essential problem for developing BCI systems based on portable EEG. There are several works introducing methods for recognition and removal signal artifact of different nature for multichannel EEG [2]. Basically, the methods used are based on signal decomposition, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 157–164, 2023. https://doi.org/10.1007/978-3-031-44865-2_17
158
A. K. Ekizyan et al.
Fig. 1. Representation of three dimensional matrix of epoched data and example of data set format. Columns naming: feature channel
such as ICA and PCA, which requires high computational resources [3–5]. Along with them, machine and deep learning methods are also used [6,7], but show better accuracy score on huge datasets. This study focuses on classification task of movement-related artifacts appearing during recording via portable EEG described above and few mental states (neutral, relaxation, cognitive state) as informative signal. Considered classification methods allow to make artifacts detection in real time.
2 2.1
Methods Experiment
According to experimental instructions 25 subjects were meant to do specified task during EEG recording. Two days of experiment were handled with interval of the week. The tasks were performed in following sequence: neutral state, relaxation state, cognitive state, head rotating, blinking, teeth tightening, calm walking. Neutral state means that subject seats with open eyes doing nothing. During relaxation task subjects were asked to close their eyes and try to relax. Cognitive task included radio receiver setup. All records were the length of 1 min except cognitive state, which was 5 min length. Recording channels correspond to Fp1, Fp2, O1, O2 electrodes by 10–20 system. 2.2
Analysis
Feature Extraction. In purpose to recognise patterns of different states feature extraction procedure was applied. According to it, raw EEG signal was divided to so called epochs - time segments of defined length. In our case we divided signal of every state and artifact to second epochs.
Real-Time Movement-Related EEG Phenomena Detection
159
Due to sampling frequency of our portable EEG is 250 Hz, each epoch contains 250 samples. Resulting three dimensional matrix can be represented as cube (Fig. 1). The features extraction included the calculation of statistical (standard deviation, standard deviations), spectral (the average value of the power spectrum in the frequency ranges 1–7; 8–13; 14–30; 30–70 Hz, respectively, and correlations between power spectra) characteristics and fractal Higuchi dimension in each epoch using the mne-features package [8]. Pandas dataframe of format containing all subjects epochs in all states was created (Fig. 1). Highuchi Fractal Dimension. In context of EEG analysis - Higuchi fractal dimension can be interpreted as a measure of signal complexity. It is commonly used in classification task of mental states [9]. According to algorithm [10] new time sequence from each epoch is calculated: Xkm : x (m) , x (m + k) , x (m + 2k) , . . . , x(m + int[N − k/k]k)
(1)
where, m = 1, 2, . . . , k and k = 1, 2, . . . , kmax . Length Lm (k) is calculated for every Xkm as shown: ⎞
⎤
⎟ |x (m + ik) − x(m + (i − 1)k)|⎠
N −1 ⎥
⎦ k int N −m k
⎡⎛
int[ N −m ] k
Lm (k) =
1 ⎢⎜ ⎣⎝ k
i=1
(2)
where N is length of original time series. The mean value of each Lm (k) found: k
Lm (k) (3) k the resulting feature is measured as the slope of least squares linear best fit from the plot of ln(L(k)) versus ln(1/k) L (k) =
HF D =
m=1
ln(L(k)) ln(1/k)
(4)
kmax parameter is taken 10 by default in mne-features. 2.3
Models
Several machine learning and neural network approaches were used in classification. In purpose to ensure that there any significant differences among states in data K-nearest neighbor classifier was applied with scikit-learn library. Model’s parameters that were chosen are K = 5 and euclidean distance metrics. Testing accuracy was 54.58% which allows to accomplish hiher results with other methods.
160
A. K. Ekizyan et al.
Fig. 2. Combined input model with concatenated output
Layer (type) input Dense
Output shape
(None, 56) (None, 135) Batch Nor- (None, malization 135) Dense (None, 7)
Activation function — tanh — softmax
Fig. 3. Proposed perceptron with hidden layer architecture. First layer receives features vector calculated according to Sect. 2.2. Precise description of perceptron model layers
1. Random forest classifier is a popular machine learning algorithm that is used for classification tasks. The following parameters were selected: the number of trees in the forest n estimators = 1000, the number of features to consider when looking for the best split max features = ‘auto’, the maximum depth of the tree max depth = 15, the function to measure the quality of a split criterion = ‘entropy’. 2. Multi-layer perceptron of given architecture (Fig. 3) was built using keras framework. Neuron number in hidden layer and other parameters were found by grid search cross validation.
Real-Time Movement-Related EEG Phenomena Detection
161
Fig. 4. Considered CNN model. Table 1. Precise list of CNN’s layers Layer (type)
Output shape
Activation function
conv2d(Conv2D)
(None,1,247,256) ReLU
batch normalization 1
(None,1,247,256) —
max pooling2d(MaxPooling2D) (None,1,124,256) — dropout 1(Dropout)
(None,1,124,256) —
conv2d 1(Conv2D)
(None,1,124,128) ReLU
batch normalization 2
(None,1,124,128) —
max pooling2d 1(MaxPooling) (None,1,62,128)
—
dropout 2(Dropout)
(None,1,62,64)
—
conv2d 2(Conv2D)
(None,1,62,64)
ReLU
batch normalization 3
(None,1,31,64)
—
max pooling2d 2(MaxPooling) (None,1,31,64)
—
dropout 3(Dropout)
(None,1,31,32)
—
conv2d 3(Conv2D)
(None,1,31,32)
ReLU
batch normalization 4
(None,1,16,32)
—
max pooling2d 3(MaxPooling) (None,1,16,32)
—
dropout 4(Dropout)
(None,1,16,16)
—
conv2d 4(Conv2D)
(None,1,16,16)
ReLU
batch normalization 5
(None,1,8,16)
—
max pooling2d 4(MaxPooling) (None,1,8,16)
—
flatten
(None,128)
—
Dense
(None, 7)
—
3. Basic convolutional neural network of architecture (Fig. 4) was tested on raw data (input matrix dimensionality: channels × samples). Full list of layers and activation functions are showed in Table 1. 4. Combined input model uniting MLP and CNN of architecture (Fig. 2) is proposed. In purpose to use both feature extracted and raw data previous methods were used in one model.
162
3
A. K. Ekizyan et al.
Results
Whole data-set was split in proportion of 70/30% train and test set respectively. Models hyperparameters were defined by grid Search 5-fold cross validation on train set. Number of samples in train set was 8528 and in the test set was 3655. The models with defined hyperparameters were retrained again on train data. The final evaluation was carried out on test data. Whole process of evaluating is shown on Fig. 5.
Fig. 5. Model best parameters choosing and accuracy evaluating scheme.(for more information follow scikit-learn documentation)
The models were trained on AMD Ryzen 7 3700x (8 cores x 3.6 GHZ). Table 2 shows epochs of learning, whole learning time, prediction time of one sample and resulted prediction accuracy (full number of samples/number of correctly predicted). Table 2. Models performance characteristics Model
Learning epochs Learning time, sec Prediction time per sample, sec 10−5
Accuracy %
CNN
50
1248.93
71
77.59
MLP
256
69.08
5.83
68.55
1312.88
29
80.16
Combined input 60 RFC
—
87.84
21
54.51
KNN
—
0.17
6.26
54.58
Confusion matrix in purpose to see prediction mistakes for highest accuracy model was displayed (Fig. 6). At first it looks that model mostly make mistakes on dividing Neutral/Relaxation state. That is the obvious result of data-set being unbalanced. Table 3 shows accuracy of prediction of each class.
Real-Time Movement-Related EEG Phenomena Detection
163
Fig. 6. Confusion matrix of predicted samples for combined input model in subjectindependent classification Table 3. Prediction accuracy of each class Class
Accuracy %
Neutral
79.88
Relaxation 79.24 Head
82.26
Teeth
88.51
Blinking
86.39
Walking
68.99
Cognitive
76.12
The prediction accuracy of such artifacts as Head rotating, Teeth tightening and Blinking is over 80%, which is not difficult to explain due to it’s spectral and temporal characteristics. The worst predicted class is Calm walking and it is mainly confused with Neutral state.
164
4
A. K. Ekizyan et al.
Conclusion
The study aimed to explore different machine learning and neural network techniques for detecting movement-related artifacts. The results indicated that machine learning methods, specifically RFC and KNN, with pre-feature extraction, had quick prediction times but low accuracy rates, which is not ideal for artifact detection. On the other hand, neural network models had longer learning times but similar prediction times. The suggested CNN and combined input models had higher accuracy rates of approximately 77% and 80%, respectively, making them suitable for portable BCI systems. Acknowledgement. The authors express gratitude for the support from the Strategic Academic Leadership Program of the Southern Federal University (“Priority 2030”).
References 1. Yadav, H., Maini, S.: Electroencephalogram based brain-computer interface: applications, challenges, and opportunities. Multimed. Tools Appl. 1–45 (2023). https:// doi.org/10.1007/s11042-023-15653-x 2. Urig¨ uen, J., Zapirain, B.: EEG artifact removal - state-of-the-art and guidelines. J. Neural Eng.12, 031001 (2015). https://doi.org/10.1088/1741-2560/12/3/031001 3. Stone, J.: Independent Component Analysis, vol. 6 (2005). https://doi.org/10. 1002/0470013192.bsa297 4. Gorjan, D., Gramann, K., De Pauw, K., Marusic, U.: Removal of movementinduced EEG artifacts: current state of the art and guidelines. J. Neural Eng. 19 (2022). https://doi.org/10.1088/1741-2552/ac542c 5. Turnip, A., Kusumandari, D.: Artifacts removal of EEG signals using adaptive principal component analysis (2015). https://doi.org/10.2991/iccst-15.2015.34 6. Ksiezyk, R., Blinowska, K.J., Durka, P.J., Szelenberger, W., Androsiuk, W.: Neural net- works with wavelet preprocessing in EEG artifact recognition (1998) ¨ 7. Ozdemir, M., Kizilisik, S., G¨ uren, O.: Removal of ocular artifacts in EEG using deep learning, pp. 1–6 (2022). https://doi.org/10.1109/TIPTEKNO56568.2022. 9960203 8. chiratti, J.-B., Le Douget, J.-E., Le Van Quyen, M., Essid, S., Gramfort, A.: An ensemble learning approach to detect epileptic seizures from long intracranial EEG recordings. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 856–860 (2018). https://doi.org/10.1016/j.bspc. 2020.102351 9. Kesic, S., Spasic, S.: Application of Higuchi’s fractal dimension from basic to clinical neurophysiology: a review. Comput. Methods Programs Biomed. 133 (2016). https://doi.org/10.1109/ICASSP.2018.8461489 10. Hern´ andez-del-Toro, T., Reyes-Garcia, C.A., Villase˜ nor-Pineda, L.: Toward asyn chronous EEG-based BCI: detecting imagined words segments in continuous EEG signals. Biomed. Signal Process. Control 65, 102351 (2021). https://doi.org/10. 1016/j.cmpb.2016.05.014
Recognition of Spoken Words from MEG Data Using Covariance Patterns Vitaly Verkhlyutov1(B) , Evgenii Burlakov2 , Victor Vvedensky3 , Konstantin Gurtovoy3 , and Vadim Ushakov4,5 1
Institute of Higher Nervous Activity and Neurophysiology of RAS, Moscow, Russia [email protected] 2 University of Tyumen, Tyumen, Russia 3 RNC Kurchatov Institute, Moscow, Russia 4 National Research Nuclear University MEPhI, Moscow, Russia 5 Institute for Advanced Study of the Brain, Lomonosov Moscow State University, Moscow, Russia
Abstract. The most significant advances in the development of BCI systems have been achieved using invasive methods (EKoG and stereo EEG). However, EEG and MEG methods that do not require neurosurgical intervention will take their place in the future when solving the problems of building BCI and AI systems. Usually, artificial neural networks that use the original signal are used to analyze MEG and EEG for the purpose of speech decoding. We used for this purpose connectivity parameters in sensor space, which allowed us to identify functional connections between brain regions specific to a certain speech fragment. In doing so, we used only positively correlated signals. We tested the created processing algorithm using MEG of 7 healthy volunteers. After training, the algorithm could extract three series of 8 words repeated 5 times and mixed with background words almost without error. Mapping of correlation coefficients showed that related areas were located in occipital, parietal and temporal regions. At the same time, we cannot say with certainty that Broca’s and Wernicke’s areas are unaffected by this process. Keywords: BCI · AI · MEG · EEG in sensor space · semantic systems
· speech decoding · connectivity
The reported study was supported by the Russian Science Foundation (grant no. 2221-00756). The authors are grateful to Chernyshev B.V. and Prokofyev A.O. from the Center for Neurocognitive Research (MEG Center) at the Moscow City University of Psychology and Education for their help in the experiment. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 165–172, 2023. https://doi.org/10.1007/978-3-031-44865-2_18
166
1
V. Verkhlyutov et al.
Introduction
Decoding speech stimuli from brain activity data is an important task for theoretical and applied purposes. Within the framework of this direction, researchers are trying to solve the problem of compensation of lost functions in various types of disorders of speech reproduction and perception at the cortical level, which is directly related to BCI. At the same time, the study of this problem makes it possible to advance in the direction of improving artificial intelligence systems. Significant progress has been made in intracranial ECoG recording [1] and stereo EEG recording [2]. However, invasive methods have disadvantages associated with the need for neurosurgical intervention. Recent studies have shown that decoding macroscopic fMRI data using a trained language model allows for reasonably accurate decoding of speech based on semantic information [3]. Other non-invasive recording methods, such as EEG and MEG, have proven that speech perception and reproduction affect rhythmic [4], and evoked [5] brain electrical activity. Thus, there are all prerequisites for speech decoding based on MEG and EEG data. However, to analyze brain activity in this case we use [6] neural network technologies, the results of which are difficult to interpret. For these purposes, we propose to use a simpler technique for investigating MEG connectivity in sensory space, which is based on observations that show a remarkable similarity of current MEG activity on a certain set of sensors when listening to words that are dynamically rearranged when the subject recognizes the meaning of a speech stimulus [7].
2
Measurements
Subjects. Seven volunteer subjects (four men and three women) participated in the pilot study aimed at testing the technique. One of the subjects was lefthanded at the age of 23. The average age of the young right-handed subjects was 23.8±0.5 years. The elderly right-handed subject was 67 years old. All subjects had no history of neurological or psychiatric disorders. The study was carried out in accordance with The Code of Ethics of the World Medical Association (Declaration of Helsinki) for experiments involving humans, and the Ethical committee of the Institute of Higher Nervous Activity and Neurophysiology of the RAS approved the research protocols (Protocol No. 5 of December 2, 2020). The studies took place from 12 to 15 o’clock. Stimuli. The subject was presented with three series of speech stimuli in the form of Russian adjectives. Each series included eight original words, which were repeated five times. All forty words were randomly mixed. Before each series, three words from the same series of words were presented to adapt the subject, but the recorded data from these presentations were not considered for analysis. The series of words differed in sound duration and were 600, 800, and 900 ms, respectively. The loudness of the sound was selected for each subject and ranged from 40 to 50 dB. The frequency of the digitized words as an audio file did not
Speech Recognition from MEG Data
167
exceed 22 kHz. After a word was presented, the subject had to press the handheld manipulator button if he understood the meaning of the presented word. Pressing the button after 500 ± 100 ms (randomized) was followed by the next stimulus, but no later than 2000 ms after the previous presentation. Experimental Procedure. Before the experiment, the coordinates of the anatomical reference points (left and right preauricular points and the bridge of the nose) were determined using the FASTRAK 3D digitizer (Polhemus, USA), as well as indicator inductance coils attached to the subject’s scalp surface in the upper part of the forehead and behind the auricles. During the experiment, the subject was in a magnetically shielded multilayer permalloy chamber (AK3b, Vacuumschmelze GmbH, Germany) and his head was placed in a fiberglass helmet, which is part of a fiberglass Dewar vessel with a sensor array immersed in liquid helium. The test subject was seated so that the surface of the head was as close as possible to the sensors. To avoid artifacts, sound stimuli were delivered through a pneumatic system delivering sound from a standard audio stimulator. The stimulator was programmed using Presentation software (USA, Neurobehavioral Systems, Inc). The subject was asked to relax and close his eyes. His right hand touched a console with buttons. He had to press one key with his index finger after recognizing the heard word. At the end of the series of presentations, the subject was allowed to rest for 1–2 min. Registration. The MEG was recorded using a 306-channel VectorView hardwaresoftware complex (Elekta Neuromag Oy, Finland), whose sensors cover the entire surface of the head and consist of 102 triplets containing one magnetometer and two planar gradiometers that measure mutually orthogonal components of the magnetic field. In the present study, data from all 306 sensors were analyzed. This made it possible to record activity from both surface and deep current sources in the subject’s cerebral cortex. To record oculomotor activity, two bipolar electrooculogram (EOG) leads were used, which consisted of four electrodes located on the outer orbits of both eyes (horizontal component), as well as above and below the orbit of the left eye (vertical component). MEG and EOG signals were recorded at a sampling frequency of 1000 Hz with a bandwidth of 0.1–330 Hz. The position of the head relative to the array of sensors during the experiment was monitored in real time using indicator inductors. Recording artifacts were removed and the head position was corrected using the spatiotemporal signal separation method implemented in the MaxFilter program (Elekta Neuromag Oy, Finland). Data Availability. Stimuli and analyzed data (MEG, MRI) are available on the ZENODO resources [8].
3
Data Analysis
MEG segments were identified by word onset labels. These segments were used to build covariance matrices, calculating the Pearson correlation of each registration channel with each and thus forming a covariance vector for the original word
168
V. Verkhlyutov et al.
Cnk = cov(Mnk ), where Mnk is the vector of the MEG data for the k-th repeat of the n-th word (n = 1, . . . , N , k = 1, . . . , K). The covariation matrices were averaged for all words by subtracting from them the averaged matrix for each word with replacing the principal diagonal elements by zeros and setting to zero all elemets that less than 0.7 in the resulting matrices: Fn =
K K N 1 1 Cnk − O0.7 Cnk K KN n=1 k=1
k=1
where the operator Ox replaces the principal diagonal elements by zeros and setting to zero all elemets that less than x. The resulting patterns were used to calculate the weights of newly presented words. The weight w of a word with the covariation matrix C can be assessed with respect to the n ˆ -th word filter Fn as w = Sum(C ◦ HFn ), where the functional Sum translates any matrix to the sum of its elements, the binary operation ◦ represents the elementwise product of two matrices (of the same dimension), and H is the elementwise Heaviside function. If the weight of a recognised word exceeded the weight of all others, then the word was considered recognised. Thus, the number of the recognised word can be found from the relation word number = argmax w(Fn ). n=1,...,N
Weights were calculated for all 40 words. The 5 maximum weight values belonged to the recognized word. The recognition error was considered to be the weight reduction below the maximum weight for all presentations of all the remaining 35 other words. Thus, the system could make a maximum of 5 errors when recognizing one word and 40 errors when recognizing 8 original words. At the same time, we could evaluate the success rate of recognition as a percentage. At 100% recognition, all 5 identical words were identified in a sequence of 40 words. One error reduced the recognition success score by 2.5%. The described algorithm is implemented in Matlab [9].
4
Results
Visual analysis of the ongoing MEG activity revealed a remarkable overlap of activity signals on some sensor pairs during word sounding (Fig. 1). These observations are described in detail in our other paper [10]. Correlation analysis of the one-second segments of MEG showed the range of the correlation coefficients from r < 0.9 to r > −0.9 (Fig. 2). However, we only used r values that are grearter than 0.7 for the analysis.
Speech Recognition from MEG Data
169
Fig. 1. MEG signals from 2 channels at the time of listening to the word.
Using the above algorithm (see Sect. 3), we obtained normalized weights for all 24 words in the three series for each subject. Figure 3 shows an example of the recognition of 8 target words. Each of these words was repeated 5 times and randomly intermixed with the background words. Almost always, the weights of the target words were larger than any weight of the background words.
Fig. 2. Matrix of correlation coefficients for 306 channels of a one-second segment of the MEG during the sound of a word.
We mapped the maximum values of the correlation coefficients above the 0.7 level and found that they were distributed in the parietal, occipital, and temporal leads (Fig. 4). This location is consistent with the distributed sources found in other authors’ work on fMRI data [11] associating with the semantic network of the brain. The configuration of connections in the MEG sensor space was specific for each listened word (Fig. 5). This feature allowed us to recognize words from the MEG signal with a minimum number of errors.
170
V. Verkhlyutov et al.
Fig. 3. Word recognition. Normalized weights for target and background words. Each cell contains 5 maximum weights that match the target words.
Fig. 4. Correlation coefficient mapping for 8 original words in the series. Only r > 0.7 values were considered.
The algorithm recognized all words in one series of presentations in 3 subjects without error. In two series of presentations, the algorithm worked without errors in three subjects. The algorithm recognized words in all series of presentations in one subject. We did not find any trend in the algorithm’s recognition quality depending on age, gender, and dominant hand (one male subject was left-handed).
Speech Recognition from MEG Data
171
Fig. 5. Significant connections in MEG sensor space for the 6 original words (1–6) in the series (one of the subjects). F - frontal regions. O - occipital regions. L - left side. R - right side. MEG - red circles; GRAD - blue and green circles.
5
Discussion
An important factor for the behavior of neuronal populations is their synchronization, which allows many neurons to work in parallel and process many properties of the input signal simultaneously, establishing their multiple connections with other mental objects and their properties [12]. In our experiments, we observe that some populations are active in the perception of any word, and some are specific to a particular word. Our speech recognition system can be based on the specificity property. In doing so, we investigate only amplitude connectivity, which is due to distant connections [13] as opposed to phase connectivity, which in turn is provided by local interactions. The presence of phase synchronization in our experiments proves the presence of both positively and negatively correlated data. Phase coupling indicates possible effects associated with the rotation of current dipoles, which are due to cortical traveling waves [14]. For our decoder, we exploit properties of the MEG signal that are due to long-range interactions (along myelin fibers) between electrical sources in the brain [15]. However, the presence of signs of local processes gives hope to supplement our analysis with the method of traveling wave reconstruction developed by us earlier [16].
References 1. Anumanchipalli, G.K., Chartier, J., Chang, E.F.: Speech synthesis from neural decoding of spoken sentences. Nature 568(7753), 493–498 (2019). https://doi.org/ 10.1038/s41586-019-1119-1 2. Norman-Haignere, S.V., et al.: Multiscale temporal integration organizes hierarchical computation in human auditory cortex. Nat. Hum. Behav. 6(3), 455–469 (2022). https://doi.org/10.1038/s41562-021-01261-y
172
V. Verkhlyutov et al.
3. Tang, J., LeBel, A., Jain, S., Huth, A.G.: Semantic reconstruction of continuous language from non-invasive brain recordings. Nat. Neurosci. (2023). https://doi. org/10.1038/s41593-023-01304-9 4. Neymotin, S.A., et al.: Detecting spontaneous neural oscillation events in primate auditory cortex. Eneuro, 9(4), ENEURO.0281-21 (2022). https://doi.org/10.1523/ ENEURO.0281-21.2022 5. Anurova, I., et al.: Event-related responses reflect chunk boundaries in natural speech. Neuroimage 255, 119203 (2022). https://doi.org/10.1016/j.neuroimage. 2022.119203 6. Dash, D., Ferrari, P., Wang, J.: Decoding imagined and spoken phrases from noninvasive neural (MEG) signals. Front. Neurosci. 14, 290 (2020). https://doi.org/ 10.3389/fnins.2020.00290 7. Vvedensky, V., Filatov, I., Gurtovoy, K., Sokolov, M.: Alpha rhythm dynamics during spoken word recognition. Stud. Comput. Intell. 1064, 65–70 (2023). https:// doi.org/10.1007/978-3-031-19032-2 7 8. Verkhlyutov, V.: MEG data during the presentation of Gabor patterns and word sets. ZENODO, p. 7458233 (2022). https://zenodo.org/record/7458233 9. https://github.com/BrainTravelingWaves/22SpeechRecognition 10. Vvedensky, V., Verkhlyutov, V., Gurtovoy, K.: Extended and distant cortical areas coordinate their oscillations approaching the instant of decision making during recognition of words. In press 11. Huth, A., De Heer, W., Griffiths, T., Theunissen, F., Gallant, J.: Natural speech reveals the semantic maps that tile human cerebral cortex. Nature 532(7600), 453–458 (2016). https://doi.org/10.1038/nature17637 12. Defossez, A., Caucheteux, C., Rapin, J., Kabeli, O., King, J.-R.: Decoding speech from non-invasive brain recordings, pp. 1–15. ArXiv, 2208.12266 (2022). http:// arxiv.org/abs/2208.12266 13. Rolls, E.T., Deco, G., Huang, C.-C., Feng, J.: The human language effective connectome. NeuroImage 258, 119352 (2022). https://doi.org/10.1016/j.neuroimage. 2022.119352 14. Sato, N.: Cortical traveling waves reflect state-dependent hierarchical sequencing of local regions in the human connectome network. Sci. Rep. 12(1), 334 (2022). https://doi.org/10.1038/s41598-021-04169-9 15. Proix, T., et al.: Imagined speech can be decoded from low- and cross-frequency intracranial EEG features. Nat. Commun. 13(1), 48 (2022). https://doi.org/10. 1038/s41467-021-27725-3 16. Verkhlyutov, V., et al.: Towards localization of radial traveling waves in the evoked and spontaneous MEG: a solution based on the intra-cortical propagation hypothesis propagation hypothesis. Proc. Comput. Sci. 145, 617–622 (2018). https://doi. org/10.1016/j.procs.2018.11.073
Non-visual Eye-Movements Model During Performing Cognitive Tasks in Short-Term Memory Polina A. Lekhnitskaya(B) Neurocognitive Research Laboratory, Kazan (Volga Region) Federal University, Kazan, Russia [email protected]
Abstract. In the blank screen paradigm participants solved cognitive tasks. Current fixation duration and saccade peak velocity differ in learning and performing some kinds of tasks. Retrieving information from memory and following mental processing depends on the characteristics of input task and its difficulty. The best fixation accuracy performance was in the participants who named only two levels of task difficulty, the best saccade accuracy performance was in Random Forest Classifier for saccades. Keywords: Non-visual Eye Movements · Cognition · Eye Tracking
1 Introduction People questioned the purpose of eye movements and analyzed their behavior in different conditions. For quite a long time, the eye movements and cognition study has focused more on vision, and only between 1970 and 1980 investigators started to be interested in studying non-visual eye movements. To be clear, the term non-visual eye-movements refers to eye-movements concerned with tasks which do not explicitly rely on vision to be performed (non-visual tasks) [1]. Non-visual eye movements seem to be a phenomenon that can help to better understand the nature of cognitive processes, and thus improve the educational paradigm. This is a strong evidence that looking patterns are different in conversation depending on whether a person is a speaker of a listener [2]. Another example of research work is the theory of “lateral eye movements” in which the lateral direction of gaze shifts is made when people are presented with questions requiring reflective thought. The term of lateral EM was explained by the idea of the direction of gaze shifts is the result of asymmetrical hemispheric activation [3, 4]. But because of insufficient empirical support the findings are concerned unreliable. Thus, the answer to the question what is the eye movements pattern of cognition remains unknown what formed the basis of the forthcoming study and determined research questions: 1. Will eye movements vary during different cognitive processes? 2. When retrieving information from short-term memory, do eye movements reflect the characteristics of the input? 3. Do eye movements reflect the difficulty of the task? © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 173–178, 2023. https://doi.org/10.1007/978-3-031-44865-2_19
174
P. A. Lekhnitskaya
2 Materials and Methods The initial empirical base was collected using a stationary monitor eye-tracking system (EyeLink 1000 Plus). During the study, it was proposed to retell the narrative text and the poem, solve arithmetic exercises, create a denotation for pseudo-words and solve logical tasks. First, a stimulus with a task was presented, after which a blank screen appeared, then it was proposed to solve a cognitive task mentally without relying on the text of the task. 50 subjects (16 males, mean age - 19 years) with normal or corrected to normal vision were recruited in the study, an informed consent was signed. Stimulus switching was done by participants. Statistical processing of empirical data was carried in the program «STATISTICA», in the programming language “Python” using the “PIL”, “Numpy”, “Cv2”, “Skimage”, “Pandas”, “Os”, “Matplotlib” libraries. For exploring the second question we used Pupil Invisible glasses with sampling frequency 200 Hz after uploading to Pupil Cloud. Participants (n = 9, 1 male, mean age - 19,5 years) with normal or corrected to normal vision were asked to complete the following tasks: solving arithmetic examples with single-digit and double-digit numbers, reading different genres of texts in Russian, solving geometric problems. After that, the participants noted which task was easy, normal and difficult for them. We got a dataset with eye movement metrics and task difficulty category. With the machine learning, we tried to predict the complexity of the task.
3 Results and Discussion We performed multiple Comparisons p values (2-tailed) Kruskal-Wallis test. Dependent variables were current fixation duration and current saccade peak velocity. Current saccade peak velocity in solution of arithmetic exercises is different from multiple retelling of the text (p = 0,00), but has not statistically significant difference in creating the pseudo-word meaning and solution of the logic task (p = 1,0). EM during learning a poem (saccades and fixations involved in reading were removed) is different from other analyzed cognitive processes (p = 0,00) (H(12, N = 123875) = 5940,260 p = 0,000).
Fig. 1. The probability map for reading and retelling the text
Non-visual Eye-Movements Model During Performing Cognitive
175
Fig. 2. The probability map for reading and solving the logic tasks
Fig. 3. The probability map for reading and solving the creativity tasks
Fig. 4. The probability map for reading and solving the arithmetic tasks
Regarding EM strategies in processing a task, exploring arithmetic (p = 0.00) and pseudo-word exercises (p = 0.00) were different from all types of tasks, only reading the logic task and the text for retelling has not difference (p = 1.00). Contrary effect observed in current fixation duration results. Reading the logic task (p = 0.00) and the text for retelling (p = 0.00) has difference with all EM strategies in learning the tasks. But we do not find any difference between arithmetic and pseudo-word exercises (p = 1.00). We obtained difference in current fixation duration in following EM measurements: solution of an arithmetic exercise with creating a meaning to a pseudo-word (p = 0,05), logic solution (p = 0,00), first text retelling (p = 0,00), poem learning (p = 0,00). Interestingly, second text retelling and logic solution do not differ (p = 0,56), also between long-term memory poem retrieval and third retelling of the text the difference is not found (p = 0,72): this finding may be explained as text consolidation occurred. For understanding the way EM reflect characteristics of the stimulus while processing a task mentally, we computed probability maps, where the size of heat map image was reduced and represented by matrices of “0” and “1”. If in the pixel was black (the background color of heatmap), it was encoded as “0” meaning, that there was the absence of look and probability is equal to zero. Next, all matrices were summed and divided on their total number. In such a way, we get the final probability matrices which were colored depending on the probability. In text retelling we can observe the reduction of EM square (Fig. 1). Mann-Whitney U Test also showed significant difference between first and second text retelling in current
176
P. A. Lekhnitskaya
saccade peak velocity (p = 0.00, z = 7.19), difference in current fixation duration was absent (p = 0.17, z = -1,36). In contrary, we did not find difference between second and third text retelling in current saccade peak velocity (p = 0.88, z = -0.15) and significant difference was in current fixation duration (p = 0.006, z = 2,74). In probability map for the logic, arithmetic and word tasks, we can observe saving the shape of input stimuli (Figs. 2,3,4). Comparing to text retrieval, these mental operations are more difficult and saving the shape during processing the task may be the result of cognition. We compared the mean values of the EM parameters and found the following pattern: if current fixation duration is more than 380 and current saccade peak velocity is more than 220, such EM can be the result of learning strategy, unless - of mental operation (Table 1). Systemizing these observations, we get the formula for accessing the cognitive state (CS): CS = (
CFD CSPV + )/2 380 220
(1)
where CFD is current fixation duration, CSPV is current saccade peak velocity. Table 1. Mean non-visual eye movements parameters values Current fixation duration
Current saccade peak velocity
Ar_ex
337,92
159
Ar_sol
415,2
249
Word
282,19
175
idea
383,77
254
task
244,25
171
sol
387,67
258
fir_text_read
208,08
159
fir_tex_ret
386,72
289
second_ret
388,3
270
thir_ret
397,98
274
If we get 0, then eye movements were associated with work with a stimulus, otherwise - with mental information processing. To calculate the duration of fixation knowing the stimulus length, we need to divide this value on 1,5; for saccade peak velocity - on 1,6. This formula needs future validation and adding new possible variables for a more accurate result. In the context of this work, the goal of applying machine learning on the data obtained from Pupil glasses is to find an algorithm which can distinguish different levels of task difficulty. Metrics for fixations which were used for machine learning classification are duration (ms), fixation x (px), fixation y (px), azimuth (deg), elevation (deg), metrics for
Non-visual Eye-Movements Model During Performing Cognitive
177
saccades are gyro x (deg/s), gyro y (deg/s), gyro z (deg/s), acceleration x (G), acceleration y (G), acceleration z (G), roll and pitch. A saccade automatically initiates an encoding activity that could interrupt the ongoing processing which may be the reason of suppressing new fixations [8]. The accuracy of Random Forest Classifier for saccades is 0,64, K-Neighbors Classifier is 0,62, Gradient Boosting Classifier is 0,59 and Decision Tree Classifier is 0,55. The accuracy of Random Forest Classifier for fixations is 0,44, K-Neighbors Classifier is 0,42, Gradient Boosting Classifier is 0,47 and Decision Tree Classifier is 0,40. Fixations do reflect the activation of semantic information [5, 6], longer fixation duration can indicate deeper processing [7]. According to Gary Bargary, there are large individual differences in most oculomotor measures [8], and since the movements of the eyes may be related to characteristics of personality, we applied machine learning algorithms on each participant separately (Table 2). Table 2. Machine learning algorithms accuracy performance on distinguishing the degree of the task difficulty for each participant Random Forest Classifier
K-Neighbors Classifier
Gradient Boosting Classifier
Decision Tree Classifier
Participant 1
0,67
0,66
0,67
0,68
Participant 2
0,53
0,54
0,56
0,54
Participant 3
0,51
0,49
0,51
0,46
Participant 4
0,56
0,54
0,54
0,43
Participant 5
0,59
0,55
0,56
0,49
Participant 6
0,67
0,65
0,64
0,64
Participant 7
0,51
0,52
0,55
0,46
Participant 8
0,55
0,57
0,57
0,53
Participant 9
0,74
0,72
0,74
0,72
The best accuracy results were obtained for the first and ninth participants. After the experiment they marked only two levels of difficulty in the questionnaire, that is normal and hard, easy and hard. Mean f-1 score for easy level is 0,29, for normal level is 0,50. From the eye movements perspective, it is possible to identify differences only between two levels of task difficulty: easy and hard, normal and hard.
4 Conclusion The current study investigated whether eye movements reflect the input shape and cognitive processing. In the blank screen paradigm participants solved cognitive tasks. We found that current fixation duration and saccade peak velocity differ in learning and performing some kinds of tasks. Retrieving information from memory and following mental processing depends on the characteristics of input task and its difficulty. In the
178
P. A. Lekhnitskaya
second experiment we focused on task difficulty and tried to find its reflection on eye movements by machine learning algorithm. The best fixation accuracy performance was in the participants who named only two levels of task difficulty. As a further direction, the final formula and current observations need to be validated in several experiments.
References 1. Diamantopoulos, G.: Novel eye feature extraction and tracking for non-visual eye-movement applications: dic. University of Birmingham (2010) 2. Bavelas, J.B., Coates, L., Johnson, T.: Listener responses as a collaborative process: The role of gaze. J. Commun. 52, 566–580 (2002) 3. Ehrlichman, H., Weinberger, A.: Lateral eye movements and hemispheric asymmetry: A critical review. Psychol. Bull. 85(5), 1080–1101 (1978) 4. MacDonald, B.H., Hiscock, M.: Direction of lateral eye movements as an index of cognitive mode and emotion: A reappraisal. Neuropsychologia 30, 753–755 (1992) 5. Yee, E., Sedivy, J.C.: Eye movements to pictures reveal transient semantic activation during spoken word recognition. J. Exp. Psychol. Learn. Mem. Cognit. 32(1), 1–14 (2006). https:// doi.org/10.1037/0278-7393.32.1.1 6. Just, M.A., Carpenter, P.A.: Eye fixations and cognitive processes. Cogn. Psychol. 8(4), 441– 480 (1976). https://doi.org/10.1016/0010-0285(76)90015-3 7. Alemdag, E., Cagiltay, K.: A systematic review of eye tracking research on multimedia learning. Comput. Educ. 125, 413–428 (2018). https://doi.org/10.1016/j.compedu.2018.06.023 8. Bargary, G.: Individual differences in human eye movements: An oculomotor signature? Vis. Res. 141, 157–169 (2017)
Mechanisms for Contribution of Modifiable Inhibition to Increasing Signal-to-Noise Ratio and Contrasted Representations of Sensory Stimuli in the Neocortex Isabella G. Silkis(B) Institute of Higher Nervous Activity and Neurophysiology, Russian Academy of Sciences, Moscow, Russia [email protected]
Abstract. Proposed mechanism for increasing signal-to-noise ratio in the various CNS structures is based on changes in postsynaptic processes evoked by simultaneous action of a neuromodulator on the same type of Gs or Gq/11 protein-coupled receptors located both on the main cell and input inhibitory interneuron. If initial excitation of the main cell is relatively strong, LTP is induced at its excitatory input simultaneously with LTD at its inhibitory input. If excitation of the main cell is relatively weak, LTD is induced at its excitatory input simultaneously with LTP at its inhibitory input. Therefore, the reactions of the main cells will become stronger, and it will not respond to weak signals (which can be considered as a noise). Such effect is interpreted as increasing signal-to-noise ratio. Previously we explained why the sign (LTP or LTD) of the modifying action of dopamine on the efficacy of cortical inputs to spiny cells of the striatum (the input structure of the basal ganglia) depends on the initial strength of this excitatory input. Such character of dopamine-dependent modulations and subsequent activity reorganization in the cortico–basal ganglia–thalamocortical loops leads to a contrasting enhancement of neocortical representations of preferred sensory stimuli simultaneously with the weakening representations of other stimuli. This effect is facilitated by other neuromodulators due to an increasing signal-to-noise ratio on striatonigral spiny cells in combination with LTP of excitation of neurons in the neocortex, hippocampus, and thalamus projecting onto spiny cells. The proposed mechanisms are fundamentally different from generally accepted ones. Keywords: Signal-to-noise ratio · Contrasted signal representation in the neocortex · Modulation of the efficacy of excitatory and inhibitory synaptic transmission
Abbreviations C-BG-Th-C loop Cortico-basal ganglia-thalamocortical loop INs Parvalbumin containing GABAergic interneurons SNR Signal to noise ratio © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 179–187, 2023. https://doi.org/10.1007/978-3-031-44865-2_20
180
I. G. Silkis
1 Introduction Based on known experimental data, it is believed that afferent inhibition improves the signal-to-noise ratio (SNR), and that lateral inhibition in the neocortex involved in the shaping receptive fields. In the primary sensory cortical areas, afferent disynaptic inhibition of pyramidal cells is provided by the flow of excitation from the thalamic projection nuclei both to these cells and to GABAergic interneurons, in particular, to parvalbumincontaining fast spiking interneurons (INs). Lateral inhibition is provided by inputs from adjacent and distant neocortical areas. Known mechanisms for increasing SNR are based on the relative magnitude of cell depolarization by its excitatory input and hyperpolarization by its inhibitory input. Due to disynaptic inhibition, the main cell cannot respond to weak signals [1]. Disinhibition also functions when the input inhibitory interneuron inhibits another interneuron that is projected onto the main cell [2]. The processes involved in these mechanisms last tens of milliseconds, while it is generally accepted that learning and memory are based on the modifications (LTP and LTD) of the efficacy of synaptic transmission that last tens of minutes and hours. There are experimental evidences of a significant role of INs in increasing SNR and in the formation of reliable cortical representations of visual stimuli in the primary visual cortical area (V1) [3]. In the primary auditory cortical area (A1), the inhibition provided by IN is involved in the formation of reliable and well shaped receptive fields of pure sound tones. It has been shown that in V1 field of mice, INs have a wide tuning for stimulus orientation; and in the A1 field, the tuning of INs to the frequency of the pure sound tones is not very well shaped and weakly depends on the tone intensity [4]. Based on these data, it was concluded that the INs in the A1 field do not selectively combine input data from the local network and play only an insignificant role in the frequency tuning of the main cells [4]. The authors of [5] associated the wide frequency tuning of INs in the A1 field with a large range of their synaptic inputs. We assume that because the number of inhibitory interneurons in the neocortex is small, and since they receive innervation from many cells and inhibit a large number of cells, it is unlikely that lateral inhibition in the neocortex can provide input-specific effects for a limited group of cells. Therefore, INs cannot underlie the formation of columns in the V1 field, such as the orientation columns of Hubel and Wiesel, or explain the presence of a tonotopic map in the A1 field. The purpose of this work was to analyze possible mechanisms of involvement of afferent inhibition in the increase in SNR, and contribution of inhibitory neurons in the formation of well-defined representations of sensory stimuli in the activity of neuronal groups in the neocortex. It is obvious that in the CNS, numerous neuromodulators should be involved in these mechanisms, since they influence long-term changes of synaptic transmission. In this work, we used unified modulation rules for the long-term modifications in the efficacy of synaptic transmission that have been formulated earlier basing on changes in intracellular processes [6].
Contribution of Modifiable Inhibition to Increasing Signal-to-Noise
181
2 Contribution of Modifiable Afferent Inhibition to Long-term Rise in the Signal-to-Noise Ratio According to unified modulation rules [6], the action of neuromodulators on postsynaptic receptors coupled to Gs or Gq/11 proteins leads to an increase in the activity of protein kinases that phosphorylate AMPA and NMDA receptors and consequently induces LTP of the efficacy of synaptic transmission. If a neuromodulator simultaneously activate the same type of Gs or Gq/11 protein-coupled receptors on both the main cell and inhibitory interneuron that innervates it, providing disynaptic inhibition (Fig. 1), the resulting modulation effects on the main cell must depend on the relative strength of its excitation and inhibition [7].
Fig. 1. A simplified scheme of the influence of neuromodulators on the modification of the efficacy of synaptic transmission in the neural network that processes sensory information. D1 and D2, dopamine receptors; IN, inhibitory interneuron; Hipp, hippocampus; GPe and GPi, external and internal parts of the globus pallidus; Prefr.Cort, prefrontal cortex; Str, striatum; S-N and S-P, striatonigral and striatopallidal spiny cells; SNr, substantia nigra reticulata; SNc and VTA, substantia nigra compacta and ventral tegmental area, structures containing dopaminergic cells; lHpt, lateral hypothalamus that contains oxytocin- and vasopressinergic cells. Small triangles and squares potentiated and depressed inputs, respectively. Lines ending with arrows and rhombs, excitatory and inhibitory inputs, respectively. Thick and thin lines, strong and weak inputs, respectively. Black circles, GABAergic cells. Dash-dotted and dashed lines with open arrows, oxytocinergic and dopaminergic inputs, respectively. The basal ganglia nuclei outlined with thicker lines than other structures.
If excitation is strong enough and inhibition is relatively weak, the activity of protein kinases will predominate in the postsynaptic main cell. In this case, the phosphorylation
182
I. G. Silkis
of AMPA and NMDA receptors will result in induction of LTP at the excitatory input to the main cell, while simultaneous phosphorylation of GABAa receptors will induce LTD at the inhibitory input to the main cell (Fig. 1). Therefore, a weak inhibitory input will weaker inhibit the main cell, and a potentiated excitatory input will have a stronger effect on this cell and evoke stronger reaction. If excitation is relatively weak and inhibition is strong enough, the activity of protein phosphatases will predominate in the postsynaptic main cell. In this case, dephosphorylation of AMPA and NMDA receptors will lead to the induction of LTD at the excitatory input, while simultaneous dephosphorylation of GABAa receptors will promote the induction of LTP at the inhibitory input [7]. (For the simplicity, such case is not shown in the Fig. 1). Therefore, on the background of an additional increase in the efficacy of inhibition, a weak excitatory input will not be able to bring the main cell to the generation of spikes. Thus, under modifiable disynaptic inhibition, the reactions of the main cell to strong signals will be further enhanced, and it will not respond to weak signals. This effect can be interpreted as an increase in the SNR, and weak signals can be considered as a noise. The proposed mechanism is consistent with known experimental data. For example, it was shown that neuropeptide vasopressin, whose Gq/11 protein-coupled receptors are located on both pyramidal cells and inhibitory interneurons of the hippocampal CA1 field, promote induction of LTP of the efficacy of excitatory inputs to both types of cells [8]. Vasopressin also increases the SNR and promotes fine-tuning responses of pyramidal cells [8]. In the CA1 field, the same effect was produced by oxytocin, a neuropeptide whose receptors are also coupled to Gq/11 proteins [9]. It was assumed that oxytocin affects the functioning of the hippocampus mainly by modulating the activity of interneurons [9]. Besides, oxytocin increased the firing frequency of both pyramidal cells and inhibitory interneurons in the prefrontal cortex and in the piriform cortex involved in the processing of olfactory information [10, 11]. Therefore, oxytocin can improve the SNR in these structures as well. Based on the known data, it was concluded that due to the activation of oxytocin receptors on inhibitory interneurons in cortical and subcortical structures, an improvement the SNR could take place in different parts of the neural network [1]. In the prefrontal cortex, dopamine evoked activation of Gs proteincoupled D1 receptors promotes induction of LTP of the efficacy of excitatory inputs to both pyramidal cells [12] and INs [13]. Therefore, dopamine can improve the SNR in the prefrontal cortex. It was found that in the activity of neurons in this neocortical area, increasing the SNR under the action of dopamine improves the coding of the direction of stimulus movement [14]. A similar effect can also take place in the piriform cortex, wherein the properties of odors are represented, since in this cortical area, dopamine, acting on D1 receptors, also modulates the activity of pyramidal cells and INs, increasing the efficacy of their excitation [15]. Thus, in contrast to known mechanisms in which, the magnitude of the SNR is determined by amplitude and sign of cell polarization, the proposed mechanism is based on changes in intracellular processes in the postsynaptic main cell which underlie the simultaneous long-term modulations of the efficacy of excitatory and inhibitory inputs to this cell. These processes make it possible to maintain the improved SNR for a long time.
Contribution of Modifiable Inhibition to Increasing Signal-to-Noise
183
3 Contribution of Dopamine-Dependent Synaptic Modifications to the Generation of Contrasted Representations of Sensory Stimuli in the Neocortex Earlier we proposed the unified mechanism for processing multimodal sensory information (visual, auditory and olfactory) in the topographically organized cortico-basal ganglia–thalamocortical (C-BG-Th-C) neural loops in the CNS. These loops contain primary and higher neocortical areas, as well as subcortical structures connected with the neocortex [16–18]. Information processing in the C-BG-Th-C loops essentially depends on the release of dopamine by midbrain neurons in response to stimulus and reinforcement. Dopamine promotes the induction of LTP on strong excitatory inputs (which allow opening NMDA channels through which Ca2 + entry into a cell) to striatonigral spiny cells, on which Gs proteins-coupled D1 receptors are predominantly located. Striatonigral cells give rise to a direct disinhibitory pathway through the basal ganglia (it has two subsequent inhibitory synapses) (Fig. 1). Simultaneously, dopamine promotes the induction of LTD on strong excitatory inputs to striatopallidal spiny cells, on which Gi/0 protein-coupled D2 receptors are predominantly located. Striatopallidal cells give rise to an indirect inhibitory pathway through the basal ganglia (it has three subsequent inhibitory synapses) (Fig. 1). As a result, the inhibition of those thalamic neurons that were initially strongly activated by the preferable sensory stimulus becomes synergistically weakened through both pathways in the basal ganglia. Therefore, the activity of thalamic neurons increases and they stronger excite the neocortical neurons topographically connected with them [19, 20]. Taking into account the features of intracellular processes in striatal spiny cells, we pointed out that the modification rules for strong and weak (which do not allow opening NMDA channels) excitatory inputs to spiny cells are opposite [19]. Therefore, if corticostriatal input is weak, dopamine promotes the induction of LTP on striatopallidal spiny cells and LTD on striatonigral spiny cells. (For the simplicity, these modifications are not shows in the Fig. 1). In this case, dopamine-dependent activity reorganizations in the C-BG-Th-C loop should simultaneously lead to a decrease in the activity of topographically connected thalamic and neocortical neurons, initially weakly activated by a not preferable sensory stimulus [20]. Due to these processes, as activity circulation cycles are accumulated in the C-BGTh-C loops, a contrasted representation of the physical property of the preferred sensory stimulus in the activity of initially strongly excited neurons in the corresponding cortical area can be formed. This stimulus will be perceived more clearly, especially against the background of a weakening the activity of other neurons in this cortical area [16– 18]. In particular, this is why a fine-defined tonotopic map can be formed in the A1 field [17]. When applying this mechanism to processing of sounds, data were taken into account that field A1 neurons project only onto striatonigral spiny cells, on which D1 receptors are located. Therefore, only the direct disinhibitory pathway through the basal ganglia participates in the formation of the tonotopic map in the A1 field. The proposed mechanism is fundamentally different from the well-known mechanism of functioning of the C-BG-Th-C loop, which is based on the opposite contribution of direct and indirect pathways through the basal ganglia to the selection neocortical activity
184
I. G. Silkis
[21]. It must be noted that registration of identified striatonigral and striatopallidal cells in behaving animals confirms proposed mechanism [22–24]. Taken into account the similarity of processing of signals of different modalities [18], we suppose that proposed mechanism may underlie the formation of precise neuronal representations of various sensory stimuli in the activity of small groups of neighboring pyramidal neurons in corresponding neocortical areas.
4 The Role of Various Neuromodulatory Mechanisms in the Shaping Contrasted Representation of a Sensory Stimulus in the Neocortex The presence of disynaptic afferent inhibition in the neocortex (Fig. 1) should additionally enhance the contrasted neuronal representations of sensory stimuli, since it leads to increasing the SNR in the thalamocortical pathway in each cycle of signal circulation in the C-BG-Th-C loop. The analogous resulting effect of increasing the SNR may take place in the striatum, where INs innervate predominantly striatonigral spiny cells [25] (Fig. 1). These INs receive excitation from the neocortex, hippocampus and thalamus, and therefore exert afferent inhibition of spiny cells [26]. In addition, INs receive dopaminergic innervation, and express Gs protein-coupled D1 receptors. In accordance with the modulation rules [6], it was shown that D1 receptor activation causes an increase in the efficacy of excitation of striatal INs [27]. If the excitatory input to a striatonigral spiny cell is strong enough, dopamine may increase the SNR in this cell. This will enhance the reactions of striatonigral spiny cells and promote subsequent disinhibition of thalamic neurons by the output basal ganglia nuclei (Fig. 1) contributing to better shaping the representations of sensory stimuli in the activity of neocortical cells. Oxytocin can directly promote induction of LTP of the efficacy of excitatory inputs to spiny cells by activation of Gq/11 protein-coupled oxytocin receptors located on the spiny cells in the ventral striatum that includes olfactory tubercle, which is involved in the processing of odors. In accordance with modulation rules [6], it was found that activation of oxytocin receptors in the ventral striatum causes an increase in the efficacy of excitation of spiny cells and rise in frequency of their firing [28, 29]. In presence of dopamine, strong excitation of spiny cells must promote contrasted representations of sensory stimuli in the neocortex due to dopamine-evoked activity reorganization in the C-BG-Th-C loop. In addition, contrasted representation can be improved by the potentiating effect of neuromodulators on the efficacy of excitation of pyramidal neurons in the neocortex, thalamus, and hippocampus that innervate spiny cells (Fig. 1). A necessary condition for LTP induction on these projection neurons is the presence of postsynaptic receptors coupled to Gs or Gq/11 proteins [6]. Indeed, there are experimental evidences of the potentiating effect of oxytocin on the efficacy of excitation of the main cells in the olfactory bulb [30]. In the olfactory circuit, this bulb functions as the thalamic nucleus. Oxytocin also potentiates excitation of neurons in the anterior olfactory nucleus [31], piriform, prefrontal, and entorhinal neocortical areas [32]. (For the simplicity, LTP in these structures is not shown in the Fig. 1).
Contribution of Modifiable Inhibition to Increasing Signal-to-Noise
185
Dopamine can cause similar effects due to presence of Gs protein-coupled D1 receptors not only in the striatum, but also in the neocortex and hippocampus, i.e. structures whose pyramidal cells project to the striatum. In accordance with the modulation rules [6], it was found that activation of D1 receptors contributed to the induction of LTP on pyramidal cells of the prefrontal cortex [12] and hippocampal CA1 field [33]. The dopamine-induced increase in neuronal activity in these structures should increase the excitation of the striatal spiny cells. Therefore, modulation rules for strong excitatory inputs can be applied to these cells.
5 Conclusions Suggested mechanism for long-term increasing the SNR in different CNS structures requires modifiable afferent inhibition. The SNR rise is based on changes in postsynaptic intracellular processes under the action of neuromodulators on the same type of Gs or Gq/11 protein-coupled receptors located both, on the main projection cells and INs that project to the main cells. The contrasted representation of the physical properties of sensory stimuli in the activity of neurons in the corresponding neocortical areas is proposed to be a consequence of the opposite sign of the modulatory action of dopamine on the efficacy of strong and weak corticostriatal inputs and subsequent activity reorganizations in the C-BG-Th-C loops. The increasing the SNR on the striatal spiny cells due to their disynaptic inhibition promotes this contrasting. Contrasting can be further improved by those neuromodulators that promote induction of LTP of the efficacy of excitation of striatonigral cells and neurons in the neocortex, hippocampus, and thalamus projecting onto spiny cells. The proposed mechanisms differ from the generally accepted ones. Unlike other mechanisms, the proposed mechanisms may underlie the sustained SNR rise over a long period, and the precise tuning of neocortical responses to sensory stimuli of various modalities.
References 1. Froemke, R.C., Young, L.J.: Oxytocin, neural plasticity, and social behavior. Annu. Rev. Neurosci. 44, 359–381 (2021). https://doi.org/10.1146/annurev-neuro-102320-102847 2. Naskar, S., et al.: Cell-type-specific recruitment of GABAergic interneurons in the primary somatosensory cortex by long-range inputs. Cell Rep. 34(8), 108774 (2021). https://doi.org/ 10.1016/j.celrep.2021.108774 3. Zhu, Y., et al.: Control of response reliability by parvalbumin-expressing interneurons in visual cortex. Nat. Commun. 6, 6802 (2015). https://doi.org/10.1038/ncomms7802 4. Moore, A.K., Wehr, M.: Parvalbumin-expressing inhibitory interneurons in auditory cortex are well-tuned for frequency. J. Neurosci. 33(34), 13713–13723 (2013). https://doi.org/10. 1523/JNEUROSCI.0663-13.2013 5. Li, L.Y., et al.: Differential receptive field properties of parvalbumin and somatostatin inhibitory neurons in mouse auditory cortex. Cereb. Cortex 25(7), 1782–1791 (2015). https:// doi.org/10.1093/cercor/bht417 6. Silkis, I.G.: The unitary modification rules for neural networks with excitatory and inhibitory synaptic plasticity. Biosystems 48(1–3), 205–213 (1998). https://doi.org/10.1016/s0303-264 7(98)00067-7
186
I. G. Silkis
7. Sil’kis, I.G.: A possible mechanism for the effect of neuromodulators and modifiable inhibition on long-term potentiation and depression of the excitatory inputs to hippocampal principal cells. Neurosci. Behav. Physiol. 33(6), 529–541 (2003). https://doi.org/10.1023/a:102396040 2109 8. Ramanathan, G., et al.: Vasopressin facilitates GABAergic transmission in rat hip pocampus via activation of V(1A) receptors. Neuropharmacology 63(7), 1218–1226 (2012). https://doi. org/10.1016/j.neuropharm.2012.07.043 9. Owen, S.F., et al.: Oxytocin enhances hippocampal spike transmission by modulating fast-spiking interneurons. Nature 500(7463), 458–462 (2013). https://doi.org/10.1038/nature 12330 10. Nakajima, M., Görlich, A., Heintz, N.: Oxytocin modulates female sociosexual behavior through a specific class of prefrontal cortical interneurons. Cell 159(2), 295–305 (2014). https://doi.org/10.1016/j.cell.2014.09.020 11. Mitre, M., et al.: A distributed network for social cognition enriched for oxytocin receptors. J. Neurosci. 36(8), 2517–2535 (2016). https://doi.org/10.1523/JNEUROSCI.2409-15.2016 12. Chen, L., et al.: Dopamine D1/5 receptor-mediated long-term potentiation of intrinsic excitability in rat prefrontal cortical neurons: Ca2+-dependent intracellular signaling. J. Neurophysiol. 97(3), 2448–2464 (2007). https://doi.org/10.1152/jn.00317.2006 13. Kröner, S., Krimer, L.S., Lewis, D.A., Barrionuevo, G.: Dopamine increases inhibition in the monkey dorsolateral prefrontal cortex through cell type-specific modulation of interneurons. Cereb. Cortex 17(5), 1020–1032 (2007). https://doi.org/10.1093/cercor/bhl012 14. Stalter, M., Westendorff, S., Nieder, A.: Dopamine gates visual signals in monkey prefrontal cortex neurons. Cell Rep. 30(1), 164-172.e4 (2020). https://doi.org/10.1016/j.celrep.2019. 11.082 15. Potts, Y., Bekkers, J.M.: Dopamine increases the intrinsic excitability of parvalbuminexpressing fast-spiking cells in the piriform cortex. Front. Cell. Neurosci. 16, 919092 (2022). https://doi.org/10.3389/fncel.2022.919092 16. Silkis, I.: A hypothetical role of cortico-basal ganglia-thalamocortical loops in visual processing. Biosystems 89(1–3), 227–235 (2007). https://doi.org/10.1016/j.biosystems.2006. 04.020 17. Silkis, I.G.: A possible mechanism of learning-evoked reorganization of receptive fields in the primary auditory cortex (A role of the basal ganglia, prefrontal cortex, hippocampus, acetylcholine and dopamine). In: Advances in Cognitive Neurodynamics (VII). Proceedings of the Seventh International Conference on Cognitive Neurodynamics 2019. Springer, pp. 125133 (2021). https://doi.org/10.1007/978-981-16-0317-4_15 18. Silkis, I.G.: On the Similarity of the Mechanisms of processing olfactory, auditory, and visual information in the CNS: A hypothesis. Neurochem. J. 17(1), 51–63 (2023). https://doi.org/ 10.1134/S1819712423010191 19. Silkis, I.: The cortico-basal ganglia-thalamocortical circuit with synaptic plasticity. I. Modification rules for excitatory and inhibitory synapses in the striatum. Biosystems 57(3), 187–196 (2000). https://doi.org/10.1016/s0303-2647(00)00134-9 20. Silkis, I.: The cortico-basal ganglia-thalamocortical circuit with synaptic plasticity. II. Mechanism of synergistic modulation of thalamic activity via the direct and indirect pathways through the basal ganglia. Biosystems 59(1), 7–14 (2001). https://doi.org/10.1016/s0303-264 7(00)00135-0 21. Sil’kis, I.G.: Mechanisms of the influence of dopamine on the functioning of basal ganglia and movement choice (a comparison of models). Neurochem. J. 7(4), 270–277 (2013). https:// doi.org/10.1134/S1819712413030136 22. Friend, D.M., Kravitz, A.V.: Working together: basal ganglia pathways in action selection. Trends Neurosci. 37(6), 301–303 (2014). https://doi.org/10.1016/j.tins.2014.04.004
Contribution of Modifiable Inhibition to Increasing Signal-to-Noise
187
23. Cui, G., et al.: Concurrent activation of strital direct and indirect pathways during action initiation. Nature 494(7436), 238–242 (2013). https://doi.org/10.1038/nature11846 24. Tecuapetla, F., Jin, X., Lima, S.Q., Costa, R.M.: Complementary contributions of striatal projection pathways to action initiation and execution. Cell 166(3), 703–771 (2016). https:// doi.org/10.1016/j.cell.2016.06.032 25. Gittis, A.H., et al.: Distinct roles of GABAergic interneurons in the regulation of striatal output pathways. J. Neurosci. 30(6), 2223–2234 (2010). https://doi.org/10.1523/JNEUROSCI.487009.2010 26. Szydlowski, S.N., et al.: Target selectivity of feedforward inhibition by striatal fast-spiking interneurons. J. Neurosci. 33(4), 1678–1683 (2013). https://doi.org/10.1523/JNEUROSCI. 3572-12.2013 27. Bracci, E., Centonze, D., Bernardi, G., Calabresi, P.: Dopamine excites fast-spiking interneurons in the striatum. J. Neurophysiol. 87(4), 2190–2194 (2002). https://doi.org/10.1152/jn. 00754.2001 28. Moaddab, M., Hyland, B.I., Brown, C.H.: Oxytocin excites nucleus accumbens shell neurons in vivo. Mol. Cell. Neurosci. 68, 323–330 (2015). https://doi.org/10.1016/j.mcn.2015.08.013 29. Choi, W.S., Machida, C.A., Ronnekleiv, O.K.: Distribution of dopamine D1, D2, and D5 receptor mRNAs in the monkey brain: ribonuclease protection assay analysis. Mol. Brain. Res. 31(1–2), 86–94 (1995). https://doi.org/10.1016/0169-328x(95)00038-t 30. Fang, L.Y., Quan, R.D., Kaba, H.: Oxytocin facilitates the induction of long-term potentiation in the accessory olfactory bulb. Neurosci. Lett. 438, 133–137 (2008). https://doi.org/10.1016/ j.neulet.2007.12.070 31. Oettl, L.L., et al.: Oxytocin enhances social recognition by modulating cortical control of early olfactory processing. Neuron 90(3), 609–621 (2016). https://doi.org/10.1016/j.neuron. 2016.03.033 32. Oettl, L.L., Kelsch, W.: Oxytocin and olfaction. Curr. Top. Behav. Neurosci. 35, 55–75 (2018). https://doi.org/10.1007/7854_2017_8 33. Papaleonidopoulos, V., Kouvaros, S., Papatheodoropoulos, C.: Effects of endogenous and exogenous D1/D5 dopamine receptor activation on LTP in ventral and dorsal CA1 hippocampal synapses. Synapse 72(8), e22033 (2018). https://doi.org/10.1002/syn.22033
A Photostimuli Presenting Device for Customized SSVEP-based Brain-Computer Interfaces Alexey V. Kozin(B) , Anton K. Gerasimov, Alexander V. Pavlov, and Maxim A. Bakaev Novosibirsk State Technical University, ul. Nemirovicha-Danchenko 136, Novosibirsk 630087, Russia [email protected]
Abstract. Steady State Visually Evoked Potentials (SSVEP) is a promising paradigm in Brain-Computer Interfaces (BCIs) for creating alternative channels of interaction with a wide range of peripheral equipment and devices. SSVEPbased BCIs imply a specific neural reaction in the user that emerges in response to a dedicated periodical visual stimulus. In the paper we discuss some unresolved issues in the field and present a universal device that can precisely control the parameters of the stimulation and aid in studying the level of the subjects’ responses. The results of our pilot experiments underline the importance of taking into account individual reactions of users to the presented photostimuli when designing SSVEP-based BCIs. Particularly, individually customized stimulation frequencies with high response rates and minimal time delay can increase the speed and effectiveness of the interaction. Keywords: electroencephalography · brain-computer interfaces · steady-state visually evoked potentials · photostimuli
1 Introduction Brain-computer interfaces (BCIs) are systems that use information about neural activity of the brain to generate control commands or actions. Thus, by registering neural activity and its further analysis, BCIs provide their users with an alternative communication channel with peripheral devices. As a rule, surface electroencephalography (EEG) is used as a tool to provide continuous registration of user’s neural activity for further transmission to BCIs. The EEG signals have high temporal resolution [1, 2], and the approach itself supposed a noninvasive process of measuring the electrical activity of the cerebral cortex and does not require surgical intervention. This provides high safety level for users of such BCIs, compared to the BCIs based on electrocorticography (ECoG), which requires mandatory surgical intervention and thus presents a high level of risk. Due to the latter circumstance, surface EEG-based BCIs are currently the safest and most promising technology. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 188–195, 2023. https://doi.org/10.1007/978-3-031-44865-2_21
A Photostimuli Presenting Device for Customized SSVEP
189
At the moment, various researchers have implemented the use of BCIs to control a wide variety of devices. For example, BCIs have been used to control motorized wheelchairs, monitor the cognitive state of users during mental load, in rehabilitation systems for people who have suffered a stroke, etc. [3, 4]. All these examples demonstrate an extremely high benefits of these types of devices. However, there are still unresolved issues and problems in this field that make it difficult to integrate BCIs into people’s daily lives beyond the research laboratories. One of the most actual problems we considered is the problem of generation and presentation of photostimuli in SSVEP-based BCIs. Furthermore, we have designed and manufactured a device for generation of such photostimuli.
2 Brain-Computer Interfaces Based on the SSVEP Paradigm There are many paradigms and approaches on the basis of which BCIs are built and function. One of such paradigms is “SSVEP”. SSVEP (Steady-State Visually Evoked Potentials) is a specific neuronal response that occurs as a result of presenting to a subject a periodic visual stimulus [4]. This reaction is mainly concentrated in the occipital region of the cerebral cortex and can be recorded during the EEG in the O1, O2 and Oz leads, according to the “10–20” system of electrodes placement. The main characteristics of these potentials are their frequency, which is strictly depends on the frequency of visual stimuli, and the power, which is usually significantly higher than the power of the EEG signal itself. Currently, SSVEP-based BCIs are among the most promising neural interfaces. Their advantages include [2]: 1. 2. 3. 4.
high speed of information transfer, minimal time costs for user training, relative ease of installation, safety for BCI users.
Although SSVEP-based BCI systems are the most promising solution with several undeniable advantages, there are still some unresolved issues and limitations in their development and research area. Thus, most of the current research in this area is focused on working with light stimuli of high frequency (for EEG signals this is more than 30 Hz [5]). The reason for this is that the low-frequency range of EEG signals, firstly, is affected by such high-amplitude artifacts as EOG artifacts (concentrated in the frequency range from 1 to 5 Hz [6]) and, secondly, is overlapped by more powerful alpha-rhythm of the EEG signal (concentrated in the frequency range from 8 to 14 Hz). The use of photo stimulus frequencies from the middle frequency range of EEG signals (from 12 to 30 Hz [5]), according to research results, can lead to strong fatigue of the subject, and in some cases, can provoke an epileptic seizure for individuals suffering from photosensitive epilepsy [4]. Another important aspect is that due to the subject’s individual characteristics, their response level to a certain stimulus frequency may be too low relative to the EEG signal itself, making this frequency unsuitable for use [5]. This makes it difficult to build a universal BCI that uses a single frequency range for its users. Additionally, it should be noted that SSVEP potentials are usually most intense in the
190
A. V. Kozin et al.
frequency range around 15 Hz. However, the power of the evoked potential decreases with further increase of the stimulus frequency [7]. Many researchers and developers of BCI systems use personal computer displays as a source of photostimuli. This solution imposes certain requirements on the display. However, it is still possible for occurrence of additional frequency components, which are the divider of screen refresh rate [8]. To summarize the above, during designing of SSVEP-based BCI systems, the problem of selection of the frequency ranges used for generation of photostimuli, as well as the choice of the stimulator itself, is highly relevant. It is relevant and expedient to use, as far as possible, not only high, but also low frequency range. First of all it allows to significantly expand the frequency range used for stimulation and secondly, to increase efficiency of such BCIs using the most comfortable frequencies based on individual characteristics of each user. At the same time, in order to carry out research in this area it makes sense in the first approximation to give preference to a special photostimulator that provides generation of frequency stable photostimuli, as well as is compatible with the most of the equipment for EEG signal registration.
3 Photostimuli Presenting Device Many manufacturers of EEG research equipment also provide additional devices, such as various audio and photostimulators, personal patient buttons, eye-tracking systems, video monitoring cameras, etc. The availability of all this equipment significantly expands the range of EEG researches that can be performed. However, each of the listed types of devices can be very different from each other in their characteristics. In particular, photostimulators, such necessary in design of SSVEP-based BCIs, can differ from each other by the following characteristics: • • • •
the type of light source: incandescent lamps or LEDs, the number of light sources: single LED or LED matrix, the area of the diffusing surface, compatibility with EEG recording equipment.
During the analysis of the available EEG equipment markets, we could not find such photostimulators, which would be compatible with most EEG equipment manufacturers and allow in addition to fine-tune the frequency and duration of the photostimuli. Therefore, we proposed the development of our own photostimulator with all the necessary characteristics and functionality (Fig. 1). The basis of our proposed device is ATmega328P AVR microcontroller, which has a maximum clock frequency of 20 MHz. As the source of light emission, the LED matrix of 60 white light LEDs was used. The range of operating frequencies used for generation of photostimuli lies between 0.5 and 60 Hz with increment value of 0.25 Hz. The duration of the photostimuli can take values from 1 to 90 s with increment value of 0.5 s. The case of this model and all detachable elements (stand, set of diffusing plates, pattern-changing stencils, etc.) were made by 3D printer via additive technologies using PLA plastic. The evenness and degree of illumination of the LED matrix are provided by a set of changeable diffusing plates of different thickness (from 2 to 0.3 mm), made with white PLA plastic. The maximum area of the diffusing surface in this case is 10 cm2 .
A Photostimuli Presenting Device for Customized SSVEP
191
Fig. 1. The model of the developed photostimulator.
The availability of this model and a 3D printer will allow any researcher to make his own version of the photostimulator, so that the results obtained from different researchers will be highly comparable with each other. At the rear part of the device there are a rotary encoder with a built-in button and a four-digit seven-segment display. By using the encoder, the user can navigate through menu items, adjusting parameters such as the frequency and duration of photostimuli or running their generation process. With the use of a set of interchangeable pattern-changing stencils made from non-transparent black PLA plastic, the researcher can set any shape and pattern of photostimuli, which may be useful for various experiments (Fig. 2).
Fig. 2. Examples of interchangeable pattern-changing stencils.
192
A. V. Kozin et al.
The photostimulator can be used either in a fully stand-alone mode, when all settings and the launching process are controlled by the user through the encoder, as well as in pairing with a PC. For the latter mode we have developed special software with a userfriendly graphical interface with the use of C ++ programming language and Qt library. This software, firstly, duplicates all the functionality of the device available in the standalone mode and, secondly, supports network transmission of the photostimulator signal through the Lab Streaming Layer (LSL) protocol [9]. The signal generated by our device is a sequence of rectangular pulses the amplitude of which is set equal to the stimulation frequency used, and the duration coincides with the photostimuli period. Thus, this signal contains all the information about the stimulation parameters settings in the device, as well as the start and end times of the photostimuli presentation. The use of the LSL protocol is a universal way to translate time series and measurement results in research experiments with various devices in real time [9]. In particular, modern EEG signal recording equipment and software increasingly support signal transmission to a local network by the use of this protocol. This will make it possible to use our device in synchronous mode with EEG recording devices of any manufacturer that has implemented LSL protocol support in their software.
4 The Results of Device Testing Using the device we designed and the Mitsar-EEG-SmartBCI portable encephalograph with MCScap textile EEG cap with pre-installed Ag/AgCl electrodes placed according to the “10–20” system, we performed a series of experiments involving presentation of photostimuli. The sampling frequency of the EEG signals, as well as the signal of the photostimulator, was set to 250 Hz. A digital notch filter was applied to all channels of the EEG signal to suppress interference from line noise near the frequency of 50 Hz. The study was approved by the ethics committee of the Faculty of Humanities Education of NSTU (at 01.02.2023, protocol №3). During each experiment, every subject was sequentially exposed to photostimuli in the frequency range from 5 to 25 Hz with 1 Hz increment. The duration of each stimulation was 60 s. Throughout the experiment, the subject was in a comfortable chair in a relaxed state. According to the rules of EEG studies, the room where the subject was located was shadowed. The device for presenting photostimuli was in the direct sight of the subject at a distance of 1 m. Before the experiment, each subject was preselected for the absence of any neurological disorders and injuries. There were 3 subjects in the pilot experiment, and the average age of the subjects was 23.3 years. During the experiment, the power spectral density values of each EEG channel were displayed on real-time charts. During the stimulation period, an increase in the power spectrum of the EEG signal was observed in the occipital channels at the stimulation frequency and, in most cases, at multiple frequencies as well (Fig. 3). As can be seen from the figure above, when presenting photostimuli with a frequency of 7 Hz to the subject, the power at the corresponding frequency in the EEG signal spectrum significantly increased. This is related to the occurrence of the SSVEP potential. Also, in this chart we can observe an increase of power at multiple frequencies, namely at 14, 21 and 28 Hz. At the same time, when the frequency of stimulation increased, the
A Photostimuli Presenting Device for Customized SSVEP
193
Fig. 3. The EEG signal chart for the Oz channel and its power spectral density during the subject’s exposure to 7 Hz photostimuli.
power of both the main and the multiple frequencies began to decrease. These phenomena are very typical for SSVEP potentials, and the later one can also be used to distinguish the potentials themselves from the normal EEG activity [7]. After completion of all experiments, by using MATLAB development environment (version R2022b), we plotted the power dependences of SSVEP potentials on the frequency used (Fig. 4), as well as the time delay before their onset relative to the start of photostimuli presentation (Fig. 5). A threshold value of 80% relative to the maximum peak power for each SSVEP potential was used when plotting these charts. This is needed in order to measure their time delay more accurately, since often once the stimulus has been presented, the power amplitude peak of the SSVEP potential is already high enough, but is ignored because there is a slightly larger peak, which may appear only towards the end of the stimulation. As can be seen from the charts, the use of our device allowed not only to evoke SSVEP potentials, but also demonstrated significant differences in the subjects’ reactions to the presented photostimuli for a single set of frequencies. Thus, the first subject (Fig. 4, Subject_01) demonstrated an acceptable response in terms of SSVEP potential power (relative to the background EEG activity values) in the frequency range from 7 to 22 Hz, but the time delay of the increase of SSVEP potentials was minimal only in the frequency range from 19 to 24 Hz and was about two to three seconds (Fig. 5, Subject_01). The frequencies of 9, 14, and 15 Hz can also be considered acceptable in terms of time delay.
194
A. V. Kozin et al.
Fig. 4. Dependence of SSVEP potentials power on photostimuli frequency.
Fig. 5. Dependence of SSVEP potentials time delay on photostimuli frequency.
The second subject (Fig. 4, Subject_02) demonstrates an acceptable SSVEP potential power response from 5 Hz to 18 Hz, after which his response level decreases significantly. However, the time delay of the SSVEP potentials can be considered acceptable only from 9 to 19 Hz (Fig. 5, Subject_02). The third subject (Fig. 4, Subject_03) demonstrated extremely low levels of response of SSVEP potentials with acceptable values achieved in the frequency range from 8 to 12 Hz and at separately taken stimulation frequencies of 20, 23 and 24 Hz. The time delay of SSVEP potentials at most frequencies was unacceptably long, exceeding 20 s from the moment of stimuli presentation (Fig. 5, Subject_03). The minimal time delay for this subject was reached at the following frequencies: 10, 12, 13, 15, 16, 23 and 24 Hz and was about two to four seconds.
A Photostimuli Presenting Device for Customized SSVEP
195
5 Conclusion SSVEP-based BCIs are a promising and perspective solution for creating alternative channels of interaction with a wide range of peripheral equipment and devices. However, there are some unresolved and actual issues in this area. Present paper describes the peculiarities of generation and presentation of photostimuli at designing of this kind of neural interfaces. As the results of experiments with our proposed device show, it can become a good auxiliary instrument for studying the level of subjects reactions to presented photostimuli when design of SSVEP-based BCIs, as well as in any other research, which requires generation of photostimuli. The results of the experiments using the obtained device indicate that when designing SSVEP-based BCIs it is necessary to carefully consider the individual characteristics of each specific person. Otherwise, some of the frequencies used in the BCI may be either unusable or ineffective for a particular user. Individual selection of stimulation frequencies with high response rates and minimal time delay can significantly increase the speed and efficiency of SSVEP-based BCIs.
References 1. Aricò, P., Borghini, G., Di Flumeri, G., et al.: Passive BCI beyond the lab: current trends and future directions. Physiol. Measur. 39(8), 08TR02 (2018) 2. Chen, X., Wang, Y., Nakanishi, M., Gao, X., Jung, T.-P., Gao, S.: High-speed spelling with a noninvasive brain–computer interface. Proc. Natl. Acad. Sci. 112(44), E6058–E6067 (2015) 3. Zander, T.O., Kothe, C., Jatzev, S., Gaertner, M.: Enhancing human-computer interaction with input from active and passive brain-computer interfaces. In: Tan, D.S., Nijholt, A. (eds.) Brain-computer interfaces, pp. 181–199. Springer London, London (2010). https://doi.org/10. 1007/978-1-84996-272-8_11 4. Popova, V.A., Gremitsky, I.S.: Post-stroke rehabilitation system based on stationary visualevoked potentials. Politekhnicheskiy molodezhnyy zhurnal [Politechnical student journal], no. 10(63) (2021). https://doi.org/10.18698/2541-8009-2021-10-741.html 5. Zhu, D., Bieger, J., Molina, G.G., Aarts, R.M.: A survey of stimulation methods used in SSVEP-based BCIs. Comput. Intell. Neurosci. 2010, 1–12 (2010) 6. Zhang, J., Gao, S., Zhou, K., Cheng, Y., Mao, S.: An online hybrid BCI combining SSVEP and EOG-based eye movements. Front. Human Neurosci. 17, 1103935 (2023) 7. Liu, B., Huang, X., Wang, Y., Chen, X., Gao, X.: BETA: A large benchmark database toward SSVEP-BCI application. Front. Neurosci. 14, 627 (2020) 8. Chen, X., et al.: Optimizing stimulus frequency ranges for building a high-rate high frequency SSVEP-BCI. IEEE Trans. Neural Syst. Rehabil. Eng. 31, 1277–1286 (2023) 9. Lab Streaming Layer Homepage. https://labstreaminglayer.org. Accessed 01 June 2023
Graph Neural Networks for Analysis of rs-fMRI Differences in Open vs Closed Conditions Tatiana Medvedeva1 , Irina Knyazeva1,2(B) , Ruslan Masharipov1 , Maxim Kireev1,2 , and Alexander Korotkov1 1
N.P. Bechtereva Institute of the Human Brain, Russian Academy of Sciences, Saint-Petersburg, Russia [email protected] 2 Saint-Petersburg State University, Saint-Petersburg, Russia
Abstract. Functional Magnetic Resonance Imaging (fMRI) is a noninvasive neuroimaging technique widely used for research purposes. Appliation of fMRI for medical purposes is still very limited inspite of considerable potential for offering valuable prognostic and differential diagnostic information. One of the problems limiting the use of fMRI in medical settings is that fMRI data is represented as a four-dimensional array of information, and diagnostics relies on the methods employed for data processing only while visual analysis of raw data is impossible. Thus further development of the use of fMRI in clinical practice directly depends on the effectiveness and reliability of the data processing methods used. Resting-state is the main way of scanning in clinical neuroimaging. Resting-state fMRI (RS-fMRI) data can be collected under three conditions: eyes closed (EC), eyes open (EO), and eyes fixated on a target (EO-F), each presenting distinct neuronal activity patterns. It is widely acknowledged that significant differences exist between these three states, making the classification of eye open/closed states a robust basis for verifying models that can be used for diagnostic purposes. We have studied the performance of graph neural networks (GNNs) in identifying dissimilarities between eyes closed and fixated conditions. Additionally, we employ interpretation algorithms to gain insights into the crucial edges influencing the GNN model’s classification. Our proposed GNN model achieves an accuracy of up to 81% in distinguishing between these conditions, with notable brain regions, including visual networks, the default mode network, and the frontoparietal cognitive control network, playing a vital role in accurate classification, consistent with findings from existing literature. Our research highlights the potential of GNNs as a promising approach for exploring functional connectivity differences in RS-fMRI data. Keywords: Graph neural networks · RS-fMRI · Functional connectivity · Explainable AI · open-close conditions c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 196–205, 2023. https://doi.org/10.1007/978-3-031-44865-2_22
GNN for Analysis of rs-fMRI Differences in Open vs Closed Conditions
1
197
Introduction
In the modern era, the diagnosis of nearly all brain disorders heavily relies on neuroimaging studies. Among these, functional data obtained through functional Magnetic Resonance Imaging (fMRI) is considered to be more informative than structural data. While structural data provides high-resolution threedimensional images, functional data adds an additional dimension of insights. However, due to the complexity of fMRI data, even for highly skilled specialists, visual analysis becomes challenging, leading to a significant dependence on the chosen analysis methods for diagnostic outcomes. Typically, diagnostic procedures are conducted during the resting state, where Resting-State fMRI (RS-fMRI) is utilized. Notably, RS-fMRI can be performed under three different conditions: eyes closed (EC), eyes open (EO), and eyes fixed on a target (EO-F). It is widely acknowledged that differences exist between these conditions. Various studies have compared the operational resting states, and a growing number of papers continue to explore this area, exhaustive overview is given in [21]. Based on the analysis of these works, it becomes evident that distinct differences exist between the states of operational rest. However, it is crucial to note that different studies often report different effects, with limited overlap among their findings. Specifically, in [22] showed that default mode network had stronger connectivity for eyes fixated, than for eyes closed. In [17] the authors reported that default mode network connectivity maps are largely similar across eyes open, eyes fixated and eyes closed conditions. In addition, during eyes open condition visual networks are activated the most. Eyes fixated result in highest activations in default mode network, attention network. In [1], the authors found notable differences in visual network for eyes closed and fixated as an obvious consequence of different visual input. During EO-F, the connectivity of visual networks to both themselves and other networks was enhanced, except for auditory and sensorimotor network, which had higher connectivity during EC. This discrepancy may be attributed to several factors, including the relatively small number of subjects in each experiment, statistical issues related to reproducibility, and the diversity of approaches employed in these studies. Usually, analysis of fMRI data is performed using seed-based connectivity analysis with mass-univariate methods, which include standard statistical testing in search for differences in experimental conditions. Such methods are limited in their ability to explore complex brain networks, because the analysis is performed on one variable at a time (i.e., in voxel by voxel manner), which makes it incapable of processing the information as a whole [14]. In contrast to mass-univariate approach, the multivariate analysis can be used to overcome this limitation and take into account the activity in number of voxels or brain regions. It is known that the brain systems can be represented as a graph with regions of interest as nodes and connections between them as edges, thus it can be analyzed in terms of graph theory. [3] In recent years, graph neural networks (GNN) have shown great promise in analyzing complex brain networks captured by fMRI data [2,6,12,15,19]. Graph neural networks are a type of deep learning
198
T. Medvedeva et al.
models that were specifically designed to handle graph-structured data. In a graph, nodes represent entities, and edges represent relationships between these entities. GNNs analyze these relationships using message passing paradigm: 1) iteratively aggregating information from neighboring nodes and 2) updating node representations based on the local graph structure. The GNNs were applied in a handful of fMRI studies aimed to classify subjects with disorders from healthy controls. This task can be reformulated as a whole-graph classification task. As an example of GNN applications in neuroscience, in [19] the authors use graph convolution network to characterize subjects with depression. The task was to classify subjects with Major depressive disorder (MDD) versus healthy controls (HC). In [2], the authors use GCN to predict subject’s sex based on the functional connectivity. The model takes a graph obtained from brain parcellation and rsfMRI connectivity, propagate it through model, which consisted of five graph convolutional layers, global average pooling, linear layers and ReLU as an activation function. The model achieved 88% accuracy. What is more, the authors applied class activation mapping on GCN to identify brain regions unique for a gender: female had default mode network activated more, while male, at a greater extent, had higher functional connectivity measures in the sensorimotor and visual cortices. Our review of previous works has shown that study of functional connectivity differences between EO and EC conditions using graph neural networks have not been conducted yet. Nevertheless, the distinct neuronal activity patterns observed under these different conditions offer contribute to the potential utility of classifying eye open/closed states as a robust basis for validating diagnostic models. The goal of this work is to study the differences between EC and EO-F conditions using graph convolutional networks. In case of successful classification and consistent with the previous studies results will be supportive for using this approach in clinical diagnostic.
2
Data and Methods
Data Preprocessing. The dataset consisted of 84 healthy subjects, an image with eyes fixated and closed for each subjects, 168 images in total. The study protocol was approved by the Ethics Committee of N.P. Bechtereva Institute of the Human Brain, Russian Academy of Sciences (ethical protocols dated 18 April 2013 and 16 March 2017). Preprocessing was performed in CONN Toolbox for MATLAB with default settings. Functional and anatomical data were preprocessed using a flexible preprocessing pipeline [16] including realignment with correction of susceptibility distortion interactions, slice timing correction, outlier detection, direct segmentation and MNI-space normalization, and smoothing. In addition, functional data were denoised using a standard denoising pipeline [16] including the regression of potential confounding effects characterized by white matter timeseries (5 CompCor noise components), CSF timeseries (5 CompCor noise components), motion parameters and their first order derivatives (12 factors) [7], outlier scans (below 29 factors) [18], session and task effects and their
GNN for Analysis of rs-fMRI Differences in Open vs Closed Conditions
199
first order derivatives (4 factors), and linear trends (2 factors) within each functional run, followed by bandpass frequency filtering of the BOLD timeseries [9] between 0.008 Hz and 0.09 Hz. CompCor [4,5] noise components within white matter and CSF were estimated by computing the average BOLD signal as well as the largest principal components orthogonal to the BOLD average, motion parameters, and outlier scans within each subject’s eroded segmentation masks. For brain parcellation we used HCPex atlas with 426 regions of interest. [10] HCPex is a volumetric version of conventionally used HCP-MMP v1.0 [8], which in addition to 360 cortical regions includes 66 subcortical structures. Functional connectivity between two regions was computed as Pearson’s correlation coefficient. A functional connectivity matrix (FC) of size 426 × 426 was calculated for every subject, where every element of matrix represents functional connectivity strength between corresponding regions. Dataset. We used functional connectivity matrix as an adjacency matrix to create a brain graph for every subject. In order to eliminate redundant connections in graphs we used k-nearest neighbours (kNN) algorithm. We took a vector of nodal functional connectivity as node features (i.e., the corresponding row of FC matrix), as it was described in [2,19]. Deep learning algorithms are highly dependent on a bigger amount of training data. The more data the better model can generalize, so the predictions are more accurate. We used data augmentation to increase number of data samples by adding transformations to the existing samples. For that, we added white Gaussian noise to the initial BOLD timeseries. During dataset preparation for model training, we tuned the number of noised samples to be used, usually it did not exceed 80%. Classification Model. As a baseline model we used logistic regression (LR). Logistic regression is a basic linear model for a classification task. Logistic regression requires input in vector form. To create an appropriate input for LR we flattened the connectivity matrices. In addition to increased dimensionality of the input data, important information about initial graph structure is lost after flattening. To mitigate this, we additionally used principal component analysis (PCA) for dimensionality reduction. As a main model we used special kind of graph neural network: graph convolutional network. The GNN architectures diverge in how node representations are calculated. Graph convolutional networks (GCN) are one of the most popular graph architectures. The idea behind graph convolution is similar to that of traditional convolutional neural networks (CNNs), but instead of processing data arranged in a grid-like structure, graph convolutional networks operate on irregular and non-Euclidean data.[13] In GCN, aggregation function is an average of neighbouring nodes’ features in k-hop neighbourhood, where k represents a layer of the graph neural model (e.g., a two-layer model aggregates 2-hop neighbourhood). Update function is a linear transformation, i.e., multiplication by a weight matrix that is learnt during training process. The GCNs typically consist of multiple graph convolutional layers, allowing the model to capture complex patterns and relationships in the graph data.
200
T. Medvedeva et al.
The proposed GCN model consisted of 2 graph convolutional layers, global average pooling, one linear layer for classification and ReLU activation function (schematic representation is shown in Fig. 1). Adding more layers or increasing layer dimensionality did not improve results. GNNs are prone to overfitting (when a model is generalized and performs badly on unseen data) and oversmoothing (when node representations become similar among nodes so the structural information is lost) by increasing number of layers. Therefore two graph layers and one linear layer were enough in our case not to make model too complex. To lessen overfitting, learning rate scheduler was used to lower the learning rate every 10 epochs by 0.1. The goal of using a learning rate scheduler is to fine-tune the model’s learning rate to ensure that it converges to the optimal solution efficiently. Graph Representation
Functional Connectivity
Graph Convolution
Graph Convolution
ReLU
Linear
GAP
Prediction eyes closed eyes open
Fig. 1. Schematic representation of the proposed model
Saliency Mapping. In the context of model interpretability, specialized explanation techniques play a crucial role in revealing the underlying logic of models. These explanation algorithms can be categorized based on the methods they employ to understand model outputs. One such category is gradient-based algorithms, which includes saliency [20] and class activation mapping (CAM) [23]. These algorithms can be adapted for a wide range of deep learning models, as they utilize gradients or features from hidden layers to determine the importance of various features. Saliency mapping, in particular, is a method used to identify the significance of edges and nodes in making predictions. An advantage of this approach is its
GNN for Analysis of rs-fMRI Differences in Open vs Closed Conditions
201
applicability to diverse deep learning models, including graph convolutional networks. By examining the saliency map, one can observe how the model’s response would change if the input data were slightly altered. Specifically, saliency is represented as the gradient of the output score Sc corresponding to class c with respect to the input edge weight. This derivative is typically obtained through backpropagation, resulting in a matrix of similar shape as the input adjacency matrix. Values close to zero in this matrix indicate that a particular edge has negligible impact on the output, whereas higher magnitude values contribute significantly to the final score. In the context of group-level results, the frequency of appearance of each edge in the saliency map among all subjects is considered, as described in previous studies [2,19]. By employing this approach, researchers can gain valuable insights into the shared contributions of different edges across the subjects under investigation. Such interpretability techniques are instrumental in enhancing our understanding of complex deep learning models and their decision-making processes, thereby facilitating their application in various domains.
3
Results
Classification Performance. For evaluation of the models, we used group 10fold cross-validation. The main idea of this cross-validation technique is that all data attributed to one subject goes only to train or test split. In our work, one group was represented by all data for one subject (initial EO-F and EC and augmented). Our GCN model achieved 81% accuracy, the results are shown in Table 1. Even though the metrics are quite high, the variation of values is high too. The reason for that is the small training dataset. In order to train more robust models, the more data is required. Depending on the layer dimensionality, both the accuracy of the model and the number of parameters change. The more parameters a model has, the more data is needed to train it well. On the other hand, the more channels there are, the more features the model can extract. The convolutional model with number of channels 128 and 32 was chosen for further analysis, since it showed better and more robust performance. In comparison, logistic regression with PCA achieves 73% accuracy. Table 1. Classification performance on 10-fold cross validation. Mean accuracy and standard deviation on 10 folds is shown Model
Channels Trainable parameters Accuracy, %
Graph convolutional network 128, 32
58 594
81 ± 5
Graph convolutional network 256, 32
117 090
79 ± 9
Logistic regression + PCA
N/A
73 ± 10
N/A
202
T. Medvedeva et al.
Salient Regions Important for Classification. Based on resulting saliency matrices, we can build a visualization. For more clarity, we took edges, which occurred in 80% of subjects. In Fig. 2 edges are shown, which the model uses to classify EO-F and EC conditions. In the figure, size of nodes represents nodal degree, colors represent network to which a region belongs. Network partition is based on [11]. For EC there are only two regions (accordingly one edge), which occur in all subjects: left and right primary visual cortex. As for EO-F, in addition to primary visual cortex, area 1 left and right of somatosensory and motor cortex is met among all subjects. Notable that there are no highlights in ventral multimodal network for both conditions under 80% threshold. To a greater or lesser extent, the model highlights all networks. However, in general there are more highlighted edges in EO-F than in EC (e.g., in subcortical regions, CON, FPN, LAN, DMN).
4
Conclusion and Discussion
In this research, we employed graph neural networks (GNNs) to effectively classify closed and open fixated eyes based on functional connectivity. Additionally, we conducted an in-depth analysis of the edges crucial for prediction in this context. The resulting GNN model demonstrated a remarkable accuracy of 81% in distinguishing between the EO-F and EC conditions, surpassing the performance of other machine learning models. Overall, the model predominantly leveraged information from visual networks, default mode network, and frontoparietal cognitive control network, aligning with previous findings. To conclude, results of our research demonstrate the potential of graph neural networks for the classification of resting states based on functional connectivity differences in RS-fMRI data. Limitations of the Study and Further Research. Nevertheless, it is essential to acknowledge certain limitations of this study. Firstly, only one algorithm was utilized for model interpretation, while various explanation approaches, particularly tailored for GNNs, remain unexplored. Moreover, the impact of demographic factors such as age and gender on the classification results was not investigated. Future investigations addressing these potential confounding variables could provide a more nuanced understanding of the model’s performance. Additionally, the robustness of classification and interpretation results concerning variations in the choice of brain atlas was not assessed. A thorough exploration of the sensitivity of the model to atlas changes would strengthen the reliability and generalizability of the findings. However, it is worth noting that, despite its high accuracy, the GNN model exhibited limited robustness as observed through cross-validation. The considerable variation of metrics on different shuffles indicates a high risk of overfitting. To address this concern, several potential approaches can be explored.
GNN for Analysis of rs-fMRI Differences in Open vs Closed Conditions
203
Fig. 2. Visualization of the salient regions and connections; size of nodes represents the nodal degree; color represents networks; abbreviations: primary visual (VIS1), secondary visual (VIS2), auditory (AUD), somatomotor (SMN), cingulo-opercular (CON), default mode (DMN), dorsal attention (DAN), language (LAN), frontoparietal cognitive control (FPN), posterior multimodal (PMM), ventral multimodal (VMM), orbitoaffective (ORA), subcortical (SUB)
204
T. Medvedeva et al.
One avenue involves acquiring more data to enhance model generalization, or employing transfer learning techniques to leverage knowledge from related tasks. Another strategy entails refining the model architecture to reduce the number of trainable parameters, thereby enhancing its generalization capability. Successful application of these approaches to resting state classification should confirm the promising utility of graph convolution networks for diagnostic purposes. Acknowledgements. The paper supported by the state assignment of IHB RAS (No. 122041500046-5, FMMW-2022-0001).
References 1. Agcaoglu, O., Wilson, T.W., Wang, Y.P., Stephen, J., Calhoun, V.D.: Resting state connectivity differences in eyes open versus eyes closed conditions. Hum. Brain Mapp. 40(8), 2488–2498 (2019). https://doi.org/10.1002/hbm.24539. https:// onlinelibrary.wiley.com/doi/abs/10.1002/hbm.24539. eprint: https://onlinelibrary. wiley.com/doi/pdf/10.1002/hbm.24539 2. Arslan, S., Ktena, S.I., Glocker, B., Rueckert, D.: Graph saliency maps through spectral convolutional networks: application to sex classification with brain connectivity. In: Stoyanov, D., et al. (eds.) GRAIL/Beyond MIC -2018. LNCS, vol. 11044, pp. 3–13. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00689-1 1 3. Bassett, D.S., Sporns, O.: Network neuroscience. Nat. Neurosci. 20(3), 353– 364 (2017). https://doi.org/10.1038/nn.4502. https://www.nature.com/articles/ nn.4502. Number: 3 Publisher: Nature Publishing Group 4. Behzadi, Y., Restom, K., Liau, J., Liu, T.T.: A component based noise correction method (CompCor) for BOLD and perfusion based fMRI. Neuroimage 37(1), 90– 101 (2007). https://doi.org/10.1016/j.neuroimage.2007.04.042 5. Chai, X.J., Casta˜ n´ on, A.N., Ong¨ ur, D., Whitfield-Gabrieli, S.: Anticorrelations in resting state networks without global signal regression. NeuroImage 59(2), 1420– 1428 (2012). https://doi.org/10.1016/j.neuroimage.2011.08.048 6. Cui, H., et al.: BrainGB: a benchmark for brain network analysis with graph neural networks. Technical report arXiv:2204.07054, arXiv (2022). http://arxiv.org/abs/ 2204.07054. ArXiv:2204.07054 [cs, q-bio] type: article 7. Friston, K.J., Williams, S., Howard, R., Frackowiak, R.S., Turner, R.: Movementrelated effects in fMRI time-series. Magn. Reson. Med. 35(3), 346–355 (1996). https://doi.org/10.1002/mrm.1910350312 8. Glasser, M.F., et al.: A multi-modal parcellation of human cerebral cortex. Nature 536(7615), 171–178 (2016). https://doi.org/10.1038/nature18933. https:// www.nature.com/articles/nature18933. Number: 7615 Publisher: Nature Publishing Group 9. Hallquist, M.N., Hwang, K., Luna, B.: The nuisance of nuisance regression: spectral misspecification in a common approach to resting-state fMRI preprocessing reintroduces noise and obscures functional connectivity. Neuroimage 82, 208–225 (2013). https://doi.org/10.1016/j.neuroimage.2013.05.116 10. Huang, C.C., Rolls, E.T., Feng, J., Lin, C.P.: An extended human connectome project multimodal parcellation atlas of the human cortex and subcortical areas. Brain Struct. Funct. 227(3), 763–778 (2022). https://doi.org/10.1007/s00429-02102421-6
GNN for Analysis of rs-fMRI Differences in Open vs Closed Conditions
205
11. Ji, J.L., Spronk, M., Kulkarni, K., Repovˇs, G., Anticevic, A., Cole, M.W.: Mapping the human brain’s cortical-subcortical functional network organization. NeuroImage 185, 35–57 (2019). https://doi.org/10.1016/j.neuroimage.2018.10.006. https:// www.sciencedirect.com/science/article/pii/S1053811918319657 12. Kim, B.H., Ye, J.C.: Understanding graph isomorphism network for RS-fMRI functional connectivity analysis. Front. Neurosci. 14 (2020). https://www.frontiersin. org/article/10.3389/fnins.2020.00630 13. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. Technical report arXiv:1609.02907, arXiv (2017). http://arxiv.org/abs/ 1609.02907. ArXiv:1609.02907 [cs, stat] type: article 14. Kubilius, J., Baeck, A., Wagemans, J., Op De Beeck, H.P.: Brain-decoding fMRI reveals how wholes relate to the sum of parts. Cortex 72, 5–14 (2015). https://doi.org/10.1016/j.cortex.2015.01.020. https://linkinghub.elsevier. com/retrieve/pii/S0010945215000520 15. Li, X., et al.: BrainGNN: interpretable brain graph neural network for fMRI analysis. Med. Image Anal. 74, 102,233 (2021). https://doi.org/10.1016/j.media.2021. 102233. https://www.sciencedirect.com/science/article/pii/S1361841521002784 16. Nieto-Castanon, A.: Handbook of Functional Connectivity Magnetic Resonance Imaging Methods in CONN. Hilbert Press, Paris (2020). https://doi.org/10.56441/ hilbertpress.2207.6598. https://www.hilbertpress.org/link-nieto-castanon2020 17. Patriat, R., et al.: The effect of resting condition on resting-state fMRI reliability and consistency: a comparison between resting with eyes open, closed, and fixated. NeuroImage 78, 463–473 (2013). https://doi.org/10.1016/j.neuroimage.2013. 04.013. https://www.sciencedirect.com/science/article/pii/S1053811913003406 18. Power, J.D., Mitra, A., Laumann, T.O., Snyder, A.Z., Schlaggar, B.L., Petersen, S.E.: Methods to detect, characterize, and remove motion artifact in resting state fMRI. Neuroimage 84, 320–341 (2014). https://doi.org/10.1016/j.neuroimage. 2013.08.048 19. Qin, K., et al.: Using graph convolutional network to characterize individuals with major depressive disorder across multiple imaging sites. eBioMedicine 78 (2022). https://doi.org/10.1016/j.ebiom.2022.103977. https://www.thelancet. com/journals/ebiom/article/PIIS2352-3964(22)00161-X/fulltext. Publisher: Elsevier 20. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps (2014). http://arxiv. org/abs/1312.6034. ArXiv:1312.6034 [cs] 21. Weng, Y., et al.: Open eyes and closed eyes elicit different temporal properties of brain functional networks. NeuroImage 222, 117,230 (2020). https://doi.org/10. 1016/j.neuroimage.2020.117230. https://www.sciencedirect.com/science/article/ pii/S1053811920307163 22. Yan, C., et al.: Spontaneous brain activity in the default mode network is sensitive to different resting-state conditions with limited cognitive load. PLoS ONE 4(5), e5743 (2009). https://doi.org/10.1371/journal.pone.0005743 23. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization (2015)
Does a Recurrent Neural Network Form Recognizable Representations of a Fixed Event Series? Galiya M. Markova1,2(B) and Sergey I. Bartsev1,2 1 Siberian Federal University, Krasnoyarsk, Russian Federation
[email protected] 2 Institute of Biophysics SB RAS, Krasnoyarsk, Russian Federation
Abstract. Functioning in the flow of events is possible since the subject recognizes current events as familiar and acts in accordance with the representation of them. The presence of internal representations is called reflection in a broad sense and can be regarded as a requirement for effective solving of some tasks, for example, winning in a reflexive game. We consider tasks that imitate Even-odd and Rock-scissors-paper games by replacing a playmate with fixed sequences of moves. In this paper, we investigate the question whether it is possible to identify which of the available fixed sequences of moves is currently receiving by model object – a simple recurrent neural network, by decoding signals on its neurons. We show that neural-network based decoding method allows recognizing the current sequence of moves by the neural activity of the playing network, separating data that does not correspond to any of the known sequences. Therefore, simple recurrent neural networks can form stable recognizable representations associated with fixed sequences of game events in the imitation of reflexive games. This result indicates that these model objects implement reflexive processing of information and can be used for studying reflection phenomenon. Keywords: Neural-network based decoding · reflexive games · neural activity patterns · neural coding
1 Introduction One of the tasks of modern cognitive neuroscience is to reconstruct the processed information using corresponding neural activity. EEG and fMRI data are used to identify links between sensory stimuli and the pattern of brain excitation [1, 2], attempts are made to determine neural codes of behavioral responses [3]. Encoding and decoding neural activity is of great importance for neurotechnology as the ability to read and write minds [4]. Cognitive functions of human and artificial neural networks, such as visual object recognition, are compared in the context of neural decoding [5]. However, decoding the activity of the human brain is associated with a number of problems due to its extremely complex structure. For example, neural activity patterns corresponded to received stimuli are highly individual and dynamical [6, 7]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 206–213, 2023. https://doi.org/10.1007/978-3-031-44865-2_23
Does a Recurrent Neural Network form Recognizable
207
By reproduction of the simplest situation where a neural activity can be modeled and decoded, one can reveal the most general properties of the phenomenon and give a hint for the further development of decoding methods for more complex systems, in accordance with J. Von Neumann’s heuristic approach [8]. As suitable model objects, we consider simple recurrent neural networks (RNNs) [9]. The representation of stimuli received by RNN can be considered as the dynamical pattern of its neural activity. The presence of such representations is called reflection in the broad sense, and their formation is the necessary element of reflexive information processing. It can be realized only if there are objectively distinguishable internal states of the perceiving system [10]. As a task that requires reflexive processing of input data for achieving effective solution, we proposed reflexive games [11]. In this kind of game interaction a player who predicted next game steps better by using adequate internal representation of playmate’s behavior, wins consistently. For achieving reproducibility of game situations and, as a result, conducting more reliable search for recognizable representations, we decided to use imitations of reflexive games, where RNNs interact with fixed sequences of moves (time series) instead of a real playmate. In previous studies we considered the possibility of decoding the stimuli received by RNN, using its neural activity patterns during the delayed match-to-sample test [12] and Even-odd game imitation [10]. Obtained results did not allow us to state that each of the possible stimuli time series corresponds to recognizable representations in patterns of neural activity. The reason is the following: decoders that we applied – layered feedforward neural networks with one hidden layer – were trained to recognize only four classes of neural activity, which is numerically equal to the number of used time series. Such a decoder, even after achieving high decoding accuracy, is unable to distinguish “real” or “meaningful” input data (that really corresponds to time series) from “fake” or “meaningless” one (set of random numbers or neural activity that does not correspond to known time series). To deal with this issue, we present modified method of neural network-based decoding. This study is devoted to decoding the neural activity data recorded in course of two reflexive games imitations. The goal is to assess the possibility of identifying a certain time series of stimuli received by a recurrent neural network during the imitation of Even-odd and Rock-scissors-paper games, decoding its neural activity patterns. Neural network-based decoding method is performed with adding an extra class of data. This class is intended for an input data that corresponds with none of used time series.
2 Materials and Methods We used simple homogeneous RNNs (25 neurons), the functioning of which is described by the formulas: ain+1 =
ρin n n , ρin = j wi j αj + Ai , n a + |ρi |
(1)
where αjn is the output signal of the i-th neuron on the n-th cycle; wij is a matrix of weight coefficients; Ani is the input signal received by the i-th neuron at the n-th cycle; a is a constant that determines the steepness of the transient response of the neuron.
208
G. M. Markova and S. I. Bartsev
In case of Even-odd (EO) imitation, information about the playmate’s move on the previous game step entered the RNN through two inputs: 01, if the playmate chose “0”, and 10, if “1”. In case of Rock-scissors-paper (RSP) imitation, three inputs were required: 100, if the playmate chose “Rock”, 010 for “Scissors” and 001 for “Paper”. In a similar way, the move of RNN itself was determined by the ratio of the signals on output neurons (2 output neurons for EO and 3 for RSP, respectively). A quadratic loss function was used: C=
2 1 2 n a − δin i=1 i 2
(2)
where αin and δin are the actual and required signals on the output neuron of the RNN at time moment n. In EO imitation, the required signal was determined in accordance with whether the RNN played for “Even” or “Odd”: in the first case, the network was required to make the same move as the playmate, in the second – the opposite one. In RSP imitation, “Rock” defeats “Scissors”, “Scissors” defeats “Paper” and “Paper” defeats “Rock”. As a quasi-playmate in imitation of reflexive games we used fixed sequences of moves (time series) presented in the Table 1. Table 1. A list of fixed sequences of moves used in this study. Time series 1
Time series 2
Time series 3
Time series 4
Even-odd
110011001100
101100101100
010011010011
111000111000
Rock-scissors-paper
120012001200
012102012102
211200211200
220101220101
Three separate groups of RNNs were trained to process these time series according to game rules (one group for “Even” position and one for “Odd” in EO imitation, another group in RSP imitation). Processing quality was required at least 0.85 for both game imitations. We determined empirically that higher quality is unrealizable in considered conditions due to a period of non-stable functioning of RNN when switching from one time series to another. Different time series were presented to RNN in a random order, so it took several clock cycles to recognize a changed game situation. As a decoder (DN), we used layered neural networks consist of 10 hidden neurons with sigmoid activation function: x 1 +1 , (3) fh (x) = 2 a + |x| and 5 output neurons with a linear activation function: ⎧ ⎨ 0, if x ≤ 0, fo (x) = x, if 0 < x < 1, ⎩ 1, if x ≥ 1.
(4)
Does a Recurrent Neural Network form Recognizable
209
The number of hidden neurons was selected empirically as the smallest amount at which the percentage of successfully and quickly training DNs is high enough. DN’s response, i.e. the number of identified time series was determined by the number of the output neuron that produced the largest signal. Every hidden neuron of DN had a synapse connection with each of the inputs, the number of which was equal to the number of neurons of RNN. DN’s loss function was also quadratic (2). Training of RNN and DN was carried out using the back propagation algorithm with propagation depth equal to 5 clock cycles. As the input data for decoding, we took the neural activity of RNN after training procedure. It was required that RNN achieved an accuracy equal to at least 0.9 while interacting with each time series during 60 consecutive clock cycles. Then its neural activity was recorded line by line while processing the time series, getting 6 lines in total for 6 consecutive cycles of RNN’s correct functioning. These sets of lines were collected together to form the training dataset for DNs. For the extra class, representing input data that doesn’t correspond to any time series, we recorded the neural activity of RNN while processing 1/2 of all possible combinations of playmate’s moves of length 6 (which coincides with the length of the basic repeated fragment of the most time series used in the study, see Table 1). For testing the resulting accuracy of decoding and the possibility to distinguish “real” data, which corresponds to time series, from “fake” one, which doesn’t, we collected the testing dataset in the following way. We recorded the neural activity of trained RNN during 50 clock cycles while processing each time series. On the first cycles after receiving a new time series, RNN was expected to make mistakes, and so do DN (because of non-stable regime of RNN functioning). In this case, well-trained DN should decode the extra class on these cycles instead of identifying one of time series. It was figured out from practice that the typical duration of the RNN’s non-stable functioning is around 6 clock cycles. For checking the proportion of false positive decoding results, we used a set of uniformly distributed random numbers in range of neural activity (–1;1) to imitate real data. The percentage of “successful decoding” of this data as corresponding to one of time series was expected to be equal to 20% (a chance of random choice from 5 classes). It can also be assumed that well-trained DN should identify random sets of numbers as the extra class since this data differs from the data corresponded to time series. If so, then DNs appears computationally powerful enough to not only recognize different types of neural activity of RNNs but also to distinguish neural activity from any other “meaningless” data. Finally, for checking the accuracy of recognizing the extra class for DN, we collected a set of the trained RNN’s neural activity while receiving a random sequence of “playmate’s moves”. These datasets consisted of 200 lines. A well-trained DN was expected to decode this data as an extra class due to the extremely low probability of obtaining a stable RNN functioning under these conditions. Visualization of RNN’s neural activity was carried out using the principal components method, where each point is the state of the RNN on the current clock cycle of the game, and its coordinates are the levels of excitation of neurons with the corresponding serial number. The generation and operation of the RNN took place within the
210
G. M. Markova and S. I. Bartsev
Lazarus development environment (https://www.lazarus-ide.org/), the visualization of neural activity was carried out within the Scilab package (https://www.scilab.org/).
3 Results and Discussion In previous studies we obtained the accuracy of processed time series decoding in EO imitation is up to 80% [10]. But, as mentioned earlier, this result cannot be considered reliable due to inability that kind of DNs to distinguish between “real” or “meaningful” input data (that corresponds to one of existed time series) and “fake” or “spurious” one (that corresponds to nothing). Decoding accuracies obtained on test datasets in EO and RSP imitation is showed in the Table 2. Table 2. Decoding accuracies on test datasets in EO and RSP imitation. RNN 1
RNN 2
RNN 3
Average
Even
73 ± 0%
80 ± 1%
78 ± 5%
77 ± 4%
Odd
82 ± 4%
76 ± 8%
86 ± 1%
81 ± 7%
RSP
91 ± 2%
75 ± 2%
91 ± 1%
86 ± 8%
The average accuracy for both game imitations are close to the previously mentioned result of 80%, so adding the extra class to DNs didn’t lead to significant improvement in the accuracy values. The estimations of the false positives’ percentage are presented in the Table 3. As the input data, we fed random numbers in range of (−1;1) to DNs and processed their response as if it was real neural activity data by checking whether DNs managed to “guess” a number of time series that we “had in mind”. Percentage of correct “guesses” appeared close to 20% that coincided with the probability of random choice from five groups. The assumption that trained DNs should identify any “meaningless” data as the extra class, rather than corresponding to time series, was not confirmed. This result allowed us to note the limited computational power of trained DNs, sufficient enough for identifying different types of RNN neural activity and not for distinguishing real neural activity data from any other. Table 3. False positive decodings on random datasets in EO and RSP imitation. RNN 1
RNN 2
RNN 3
Average
Even
18 ± 2%
18 ± 1%
16 ± 4%
17 ± 3%
Odd
18 ± 2%
20 ± 5%
14 ± 3%
17 ± 4%
RSP
21 ± 1%
16 ± 5%
20 ± 4%
19 ± 4%
Does a Recurrent Neural Network form Recognizable
211
To compare the difficulty of EO and RSP time series decoding, we performed the reduction of DN’s synapses (see Table 4). In total, DNs got 300 synaptic connections, 25 of them were input synapses (in accordance with the number of RNN neurons whose neural activity were used as the input data for decoding). We considered the number of reduced input synapses separately as a significant parameter, since it shows how much the dimensionality of the input data can be reduced while maintaining decoding quality at the required level. Table 4. Reduced synaptic connections of DNs in EO and RSP imitation. Maximum number of reduced synapses, in total
Maximum number of reduced input synapses
Even
20 (7%)
10 (40%)
Odd
10 (4%)
2 (8%)
RSP
25 (8%)
15 (60%)
The values for RSP imitation is higher than for EO, so we concluded that RSP time series data is more accessible for decoding (in case of using RNNs and DNs with the same parameters for both games). Presumably, the reason of this is the variability of RSP combinations (three possible moves instead of only two in EO), which makes it easier to distinguish the neural activity corresponds with different time series. It can be also noted that the decoding of neural activity of the “Odd” players is more difficult comparing the “Even” ones and requires more DN’s non-zero synaptic connections. This result indicates the asymmetry in the EO playing process: it seems easier to catch than to not be catch. Concerning the decoding RNN neural activity recorded while processing totally random move sequences, we got 81 ± 5% of correct results (i.e. DN recognized the data as related to the extra class instead of one of time series) for RNNs played for “Even” position, 85 ± 5% for “Odd” players and 60 ± 6% for RSP players. Surprisingly, DNs that decoded neural activity in EO imitation trained to determine extra (not connected with time series) data better than the ones in RSP imitation. Nevertheless, the percentage of correct DN responses in both cases is much higher comparing a chance of fullrandom choice from 5 groups (20%). This result was in line with our expectations that a well-trained DN should be able to distinguish the neural activity corresponded to the processing of time series from other kinds of activity. The location of points in the phase space of neural activity (see Materials and methods), representing the lines of training datasets for imitations of both games, is shown in Fig. 1. To visualize multidimensional data, we applied the principal components analysis. As can be seen from Fig. 1, points of RNN neural activity does not spread uniformly and form groups (two groups for EO and three for RSP imitation) in the phase space of neural activity. The search for the exact reason of these groups formation is beyond the scope of this study and requires further investigation. We assume that the switch of game moves (made by RNN itself or received as stimuli from the playmate) causes this separation.
212
G. M. Markova and S. I. Bartsev
Fig. 1. Visualization of neural activity data from training dataset. Shape markers connected by lines represent neural activity of RNN while processing time series; “plus” markers – while processing other possible combinations of moves with length 6. Under the letter A and B we show neural activity of RNN during EO and RSP imitations, respectively. Graphs are plotted in principal component coordinates.
4 Conclusion In the present study we showed that neural-network based decoding method allows recognizing the current sequence of events that the simple recurrent neural network is involved in. As a model situation, we used imitations of reflexive games Even-odd and Rock-scissors-paper where recurrent networks processed four different fixed sequences of moves according to game rules. By adding the extra class for input data which doesn’t correspond to used move sequences to a decoder, we didn’t achieve decoding accuracy improvement comparing with decoders with only four classes, each of them was intended for a certain move sequence. But the fact that decoders with extra class were trained successfully and demonstrated accuracy up to 86% (for Even-odd imitation) and up to 91% (for Rockpaper-scissors imitation) is the evidence of objectively fixed difference between representations associated with certain move sequences. It was also shown that trained decoders could distinguish neural activity data corresponding to “known” move sequences from other neural activity.
Does a Recurrent Neural Network form Recognizable
213
Since it showed possible to decode currently processing sequence of moves using the neural activity of recurrent network, we concluded that simple recurrent neural networks are able to form stable recognizable representations of certain game situations. The presence of these representations indicates reflexive information processing, therefore the reflection phenomenon can be studied on these simple model objects. Acknowledgements. The work was supported by the Russian Science Foundation, Krasnoyarsk Regional Science Foundation (project no. 23–21-10041).
References 1. Dmochowski, J.P., Ki, J.J., DeGuzman, P., Sajda, P., Parra, L.C.: Extracting multidimensional stimulus-response correlations using hybrid encoding-decoding of neural activity. Neuroimage 180, 134–146 (2018). https://doi.org/10.1016/j.neuroimage.2017.05.037 2. Zhang, Y.J., Yu, Z.F., Liu, J.K., Huang, T.J.: Neural decoding of visual information across different neural recording modalities and approaches. Mach. Intell. Res. 19, 350–365 (2022). https://doi.org/10.1007/s11633-022-1335-2 3. Panzeri, S., Harvey, C.D., Piasini, E., Latham, P.E., Fellin, T.: Cracking the neural code for sensory perception by combining statistics, intervention, and behavior. Neuron 93(3), 491–507 (2017). https://doi.org/10.1016/j.neuron.2016.12.036 4. Roelfsema, P.R., Denys, D., Klink, P.C.: Mind reading and writing: The future of neurotechnology. Trends Cogn. Sci. 22(7), 598–610 (2018). https://doi.org/10.1016/j.tics.2018. 04.001 5. Horikawa, T., Kamitani, Y.: Generic decoding of seen and imagined objects using hierarchical visual features. Nat. Commun. 8, 15037 (2017). https://doi.org/10.1038/ncomms15037 6. Meyers, E.M.: Dynamic population coding and its relationship to working memory. J. Neurophysiol. 120(5), 2260–2268 (2018). https://doi.org/10.1152/jn.00225.2018 7. Stokes, M.G.: Dynamic coding for cognitive control in prefrontal cortex. Neuron 78(2), 364–375 (2013). https://doi.org/10.1016/j.neuron.2013.01.039 8. Von Neumann, J., Burks, A.W.: Theory of self-reproducing automata. IEEE Trans. Neural Netw. 5(1), 3–14 (1966) 9. Bartsev, S.I., Bartseva, O.D.: The use of neural network model objects in studies of the structure-function correlation in evolving systems. Dokl. Biochem. Biophys. 376, 19–22 (2001). https://doi.org/10.1023/A:1018891824506 10. Markova, G., Bartsev, S.: Decoding the neural activity of recurrent neural network playing a reflexive game. In: 2022 6th Scientific School Dynamics of Complex Networks and their Applications (DCNA) (Kaliningrad, 2022), pp. 185–188. IEEE (2022). https://doi.org/10. 1109/DCNA56428.2022.9923193 11. Bartsev, S.I., Markova, G.M.: Does a recurrent neural network use reflection during a reflexive game? In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y., (eds) Advances in Neural Computation, Machine Learning, and Cognitive Research VI. NEUROINFORMATICS 2022. Stud. Comput. Intell., vol. 1064, pp. 148–156. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-19032-2_15 12. Bartsev, S.I., Baturina, P.M., Markova, G.M.: Neural network-based decoding input stimulus data based on recurrent neural network neural activity pattern. Dokl. Biol. Sci. 502(1), 1–5 (2022). https://doi.org/10.1134/S001249662201001X
Language Models Explain Recommendations Based on Meta-Information Olga Sofronova1(B) and Dilyara Zharikova2 1
Moscow Institute of Physics and Technology, Moscow, Russia [email protected] 2 Former DeepPavlov.ai, Moscow, Russia
Abstract. For recommender systems, the explanation of why the item was recommended to a user increases the reliability. In this work, we introduce a post-hoc method of explaining any recommender system’s output with the use of LLMs and meta information about a recommended item and user’s preferences. We try different models and introduce metrics for estimating the quality of generated explanations. The models are evaluated on three domains and then compared to analyze the ability for domain transfer. Keywords: recommender system
1
· explanation · LLM
Introduction
Recommender systems are models created to provide suggestions for items that are most pertinent to a particular user. Recommender systems are widely used in marketing, goods and services promotion, content suggestion. Productionready recommender systems are mostly item-prediction models, which means that they do not give any explanations of the provided suggestions. At the same time, with the increasing popularity of dialog systems users may show interest in why certain recommendations were offered to them. For example, a chatbot may be asked to explain the reason why the item was recommended to a user. There are two main types of recommender systems based on their strategy: content-based filtering and collaborative filtering. Content-based filtering uses the contents (meta-information) of the items that were rated by the user in order to recommend other items that are similar. This type of system relies on the user’s own preferences to recommend items, and as a result can be more precise in its recommendations. It is worth noting that for this type meta-information about items is required. One of the advantages of content-based filtering is an opportunity for feature importance analysis, which can be utilized as a ground for recommendation explanation. Collaborative filtering uses the preferences of all users to determine the similarity between users and then recommend items that the user might like based on what other users with similar tastes enjoyed. One of the advantages of this type of system is that it is more focused on social learning c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 214–225, 2023. https://doi.org/10.1007/978-3-031-44865-2_24
Language Models Explain Recommendations
215
and enables users to “learn” from each other. The disadvantage of collaborative filtering is that the features of items are not utilized for recommendations, so there is no opportunity to extract the items’ feature importance. In this work, we aim to create a post-hoc recommendation explanation method, suitable for any recommender system, with the use of Large Language Models (LLMs). Lately, LLMs have shown a significant performance improvement on a lot of the reasoning tasks [2], and some works assume that LLMs are more about world modelling than language modelling [7,8]. In this work, we apply LLMs to generate textual explanation of the suggestions provided by a recommender system of any type. However, meta information about the user and the item is still required for an explanation. To make sure that LLM takes into account the characteristics of the suggested item, we add meta-information about the item to the LLM input. This additional information gives us an opportunity to work with recently-appeared products as we do not rely only on the LLM’s knowledge. We also add meta-information about the user preferences to the LLM input. Although LLMs with the highest number of parameters are most promising in terms of generation and reasoning quality, we fine-tune smaller LLMs on the synthetic dataset that is composed by template-based explanations based on feature importance of the content-based recommender system. Our choice of smaller LLMs is based on our computational resources while the experiments could be repeated for bigger models. Utilizing custom fine-tuned explanation model does not require paid APIs and does not compromise private data. For fine-tuning, we use movie domain, but we also hypothesize that our model will be successful to a certain degree on other domains without any additional training.
2
Related Work
There are also two main approaches to explaining the recommendations [11]. The first one is to make a recommendation based on logic understandable for a human being, and thus have the explanation by default. But the accuracy and quality of recommendations for these models are limited as they tend to be quite simple. The second one is to use post-hoc method to explain the output of another, more complicated algorithm. This method has become more popular in recent years with recommendation systems becoming more and more complex. Explanations increase the users’ acceptance of the model’s recommendations, as has been confirmed in [5]. Although the authors focus on explaining collaborative filtering models and researching the user’s behaviour towards the explanations, the study results can be applied to the work of any complex, hard-toexplain model. The idea of using language models for creating recommendations has been approached in [3]. The authors use LSTM to generate the explanation in the form of user review that could have been written for the suggested item, but they mostly focus on readability of the explanation. In [10], the authors use several explainable recommendation algorithms that may be used to substantiate the output of complex model. They consider the explanation valid when the resulting recommendation is aligned with the output
216
O. Sofronova and D. Zharikova
of the complex model. Among other methods, they use collaborative filtering approach for creating simple explanations in the form of . Their user study shows that such explanation is the least convincing for users. Consequently, we have decided that it does not achieve the goal of increasing the trustworthiness and reliability of the recommendation due to its lack of personalization and decided to focus on user-item explanations.
3
Datasets
We fine-tuned models on recommendations dataset of movie domain. To explore the fine-tuned models ability to explain recommendations in other domains, we evaluated models on a dataset with books and a small artificial dataset with dishes as items that we created ourselves specifically for the task. The datasets are described below. The domains are selected as popular ones from conversational topic dataset [9] and based on the conclusions about popular conversational topics from Alexa Prize Challenge [1,6]. 3.1
Movies Dataset
For content-based methods, both meta-information about item and user ratings are required. We extend Movielens dataset [4] with detailed meta-information about movies using IMDb API library Cinemagoer1 . The Movielens only includes a title, year and genres of movies, so, additional information was required to make explanation more precise. Searching the IMDb by movie name and year, we retrieved the following meta-information: cast, countries, languages, directors, writers, producers, composers, production companies, IMDb rating, keywords. The most popular values of language and country were English and USA, so we removed them as they did not provide useful information. From each field with multiple values (e.g., cast), only the most popular values were taken into account. Then, for each movie, we encoded all features independently and concatenated obtained vectors in the same order, which gave us the one-hot vectors of length 2477. User profiles were obtained by averaging the vectors of all the movies they gave positive (four or five stars) ratings to. In a template-based manner, the retrieved information about the movie and the user was structured into a statement. Here is an example of an input of a training sample: User. genres: children’s, comedy, drama, musical, romance; decades: 90’s movies, 80’s movies; actors: Tom Hanks, Frank Oz; keywords: female-nudity, new-york-city, based-on-novel, female-protagonist, love, friendship, teenage-girl, high-school, class-differences; writers: John Hughes; producers: Walt Disney. 1
https://github.com/cinemagoer/cinemagoer.
Language Models Explain Recommendations
217
Movie. Pretty in Pink. genres: comedy, drama, romance; decades: 80’s movies; actors: Harry Dean Stanton, James Spader, Andrew McCarthy, Annie Potts, Jim Haynie, Molly Ringwald; keywords: teenage-girl, dysfunctional-family, cleavage, class-differences, teen-angst; director: Howard Deutch; writers: John Hughes; producers: John Hughes; production companies: Paramount Pictures.
To create a recommendation based on similar features of an item and user profile, we found the movie with the largest cosine similarity with the given user and picked several most common tags. Then, with the use of other templates, the information was formed into a statement. Here is an example of an expected output of a training sample: I would recommend you watching . It has some things you seem to enjoy, like .
Generating recommendations for every user gave us a dataset with 280k samples consisting of a user, movie and recommendation statement. Since most of them were very alike (we only used 17 templates), there is a major possibility of over-fitting to repeat these patterns. As a result, the dataset with three columns was created: structured user profile, structured movie information (meta-information), recommendation statement. 3.2
Books Dataset
For evaluation of quality on book domain we used Book-Crossing Dataset [13]. It included book author, publisher, release year, which were used for creating book and user descriptions. Same as Movielens, we included additional metainformation about each book (book categories) obtained with the use of Google Books API. The users who gave less than five positive reviews were excluded from the data. Further processing of the dataset was the same as with the movie dataset. The part of the dataset used for evaluation consisted of 50 entries. The models were not fine-tuned on the books recommendations. 3.3
Dishes Dataset
Inspired by Movielens dataset, we also created a small dataset for the food domain. It includes only about 25 items, 50 users and 1020 ratings. It was transformed the same way the movie dataset was to obtain structured user profile sentences, and different items were explained as recommendations for each user. The models were not fine-tuned on food domain. The final dataset containes 50 entries of user descriptions. An example of the user profile: type: salad, breakfast, soup, dessert; keywords: toppings, meat, easy-to-make, hard-to-make, batter, boil; serving: hot, cold, warm; origin: european
218
4 4.1
O. Sofronova and D. Zharikova
Methodology Fine-Tuning LLMs to Generate Explanations
We experimented with several language models of different sizes and then compared the results. – GPT-2 of 355 million parameters, – GPT-2 of 1.5 billion parameters, – GPT-J of 6 billion parameters. We trained LLMs with and without prompts (see details in Sect. 4.4). We utilized a dataset each sample of which contained the user profile, description of the movie suggested by a recommender model for the user, and a recommendation sentence containing explanation. The method of generation of synthetic dataset will be further described in Sect. 3. Without prompt, we simply concatenated the structured information about user and movie. Each model was trained on a single A100 GPU for less than three hours. 4.2
Explaining Collaborative Filtering Models
We assume that recommendation explanation model should be able to generate explanations even for suggestions made by collaborative filtering algorithm. To validate this assumption, we asked our fine-tuned model to explain suggestions made by a simple collaborative filtering method using SVD decomposition. The setup of experiment was the same as above, except for the source of the recommendation. 4.3
Domain Transfer
We hypothesize that the recommendation explanation model has the potential to be transferred across domains. To check this assumption, we tested the model’s ability to explain recommendations in different domain without additional finetuning. The input of LLM were sentences with similar meta-information structure as users and movies in the training dataset, but containing different categories. An example is given in Sect. 3. 4.4
Prompting
Each model was trained and tested with and without the prompt, helping it better understand the task. The prompt we used is shown below. Explain why the user could like the given movie based on information about user and movie. Here is info about the user:
Language Models Explain Recommendations
219
Here is the info about the movie:
Here is your explanation:
5 5.1
Metrics Score Based on the Same Tokens
The first score is based on tokens that appear in all three blocks: user profile, movie information and generated explanation. It is calculated as follows: ssame =
m∩u∩r , |r ∪ (m ∩ u)|
(1)
where m is the set of tokens describing a movie, u is the same set of tokens of the user information, and r is the set of tokens in an explanation statement. The intuition behind this score is awarding only absolutely correct information about both the movie and the user. 5.2
Score Adding Points for Movie Tokens
The second score sitem gives a point for each token in an explanation statement which was mentioned in both movie and user meta information, and a half of a point for each token in an explanation statement which appeared only in a movie meta information. After summing up all the points, they are divided by the overall number of tokens in explanation statement for normalization. It is calculated as follows: sitem =
|r ∩ m ∩ u| + 12 |(r ∩ m)\u| |r|
(2)
The intuition behind this metric is also giving points for information about movie, because even if it has not appeared in user profile, the explanation still remains accurate as it only assumes things about user. 5.3
Score with Penalties
The third score sf ine , in its turn, extends the second one. Now we penalize the model by a half of a point for each token in an explanation statement which appeared in the user profile but was not included in a movie description. We also penalize for a point for each generated token which is not included in either the movie or the user description. To avoid over-penalizing of naturally generated text, we do not penalize stop words.
220
O. Sofronova and D. Zharikova
It is calculated as follows: sf ine = sitem −
1 2 |(r
∩ u)\m| + |r\(u ∪ m)| |r|
(3)
This metric decreases when the explanation stops being accurate. It is determined by the use of tokens not seen in movie description for an explanation. The movie description is precise, and if the explanation sentence states facts about it that are incorrect, it lowers its reliability.
6 6.1
Experiments Fine-Tuning LLMs
The considered language models GPT-2 355M, GPT-2 1.5B, GPT-J 6B were finetuned on a synthetic dataset of movies domain described in Sect. 3.1. The books and dishes datasets were utilized only for evaluation of language models on other than training domain. Training losses and BLEU scores of the proposed models are presented in Fig. 1, Fig. 2 and Fig. 3. The results of fine-tuning GPT-J 6B with and without prompting demonstrate how prompting significantly increases quality of LLM training.
Fig. 1. Training scores (BLEU and NLL loss) of fine-tuning GPT-2 355M on the synthetic movies dataset.
Table 1 shows the metrics of the models fine-tuned on a movies dataset by inferring on each of the considered domains. The examples of generated explanations are presented in Table 2 of Appendix A.
Language Models Explain Recommendations
221
Fig. 2. Training scores (BLEU and NLL loss) of fine-tuning GPT-2 1.5B on the synthetic movies dataset.
Fig. 3. Training scores (BLEU and NLL loss) of fine-tuning GPT-J 6B on the synthetic movies dataset.
6.2
Domain Transfer
Movies Domain. Fig. 4 shows that the considered evaluation metrics are relatively stable while adding more tricks to a metric: from ssame to sitem , from sitem to sf ine . The former demonstrates that most of the explanation tokens are present in both movie and user description, which makes the explanation
222
O. Sofronova and D. Zharikova
Table 1. Models trained on the movies dataset with prompting. The three considered metrics of evaluation on different domains. The higher, the better. The metric ssame averages explanation tokens appearing in both user and movie descriptions. The metric sitem also gives half a point for each token appearing only in a movie description. The metric sf ine extends sitem but also penalizes by a point for tokens not present both in either user and or movie descriptions and by a half for tokens not present in a movie description. Original model Inferred on ssame sitem sf ine GPT-2 355M GPT-2 355M GPT-2 355M
Movies Books Dishes
0.35 0.01 0.02
0.59 0.1 0.39
0.45 −0.25 0.14
GPT-2 1.5B GPT-2 1.5B GPT-2 1.5B
Movies Books Dishes
0.47 0.05 0.08
0.69 0.42 0.53
0.61 0.13 0.47
GPT-J 6B GPT-J 6B GPT-J 6B
Movies Books Dishes
0.47 0.28 0.10
0.61 0.31 0.38
0.37 -0.15 0.04
relative and, supposedly, accurate. The latter indicates that most of the stop words for movies were excluded from consideration. It also shows the minority of unidentified tokens that were not present in descriptions.
Fig. 4. Bar charts of distributions of ssame , sitem , sf ine (from top to bottom) metrics of an evaluation on movies, books and dishes (from left to right) evaluation subsets of fine-tuned on movies GPT-J 6B with prompting
Other Domains. Part of Fig. 4 represents distributions of three considered evaluation scores on books and dishes datasets. Bigger number of negative values of sf ine than for movies dataset reflects the fact that additional excluded by hands words from penalization are more specific for movie domain, and other domains (even close one, books domain) imply different words for explanations. ssame is lower that sitem . Part of the increase is due to the reason described above (ssame is more penalizing for random tokens than sitem ). But the main
Language Models Explain Recommendations
223
reason is that model mostly uses item tags not present in user profile for explanation. It can be so because there are not much common tokens, so the model has to come up with dubious explanation.
7
Conclusion
In this work, we suggested a post-hoc method for creating natural language explanations for recommendation algorithms of any complexity. We explored several models and approaches for fine-tuning and evaluating them. Also, we implemented several quantitative metrics to reflect the relevance of explanations given by a language model. We also included generated explanation examples to provide a reader to estimate the subjective quality of the fine-tuned models. Finally, we conducted evaluation experiments for estimation of domain transfer ability. The fine-tuned large language model of billions parameters demonstrated an ability to generate explanations for a similar domain almost at the same quality as for the domain that it was fine-tuned on. The future work includes constructing the metrics for better quality estimation, and increasing the quality of the model for other domains by training it on a dataset of several different domains and evaluating on even larger range of types of items. The proposed approach could be further integrated to DeepPavlov Dream [12], a platform for building modular dialog systems. Acknowledgement. This work was supported by a grant for research centers in the field of artificial intelligence, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000D730321P5Q0002) and the agreement with the Moscow Institute of Physics and Technology dated November 1, 2021 No. 70-2021-00138.
Limitations. The proposed approach for creating a synthetic dataset requires a content-based recommendation model to highlight feature importance in an explanation. The inferring-only mode of utilizing large language models on unseen domains implies that the LLM should be familiar with at least a general concept of the domain. The proposed work only explores a limited number of English-language domains. The recent increase of LLMs quality allows them to successfully generate explanations based on meta information with an appropriate prompting. Still, LLMs’ utilization requires significant computational resources even in inference mode, so task-specific fine-tuned language models can save resources for effective utilization in production. Ethics Statement. (1) This material is the authors’ own original work, which at this stage of project development has not been previously published elsewhere. (2) The paper is not currently being considered for publication elsewhere. (3) The paper reflects the authors’ own research and analysis in a truthful and complete manner. (4) We acknowledge that the use of the generative language models like GPT-J 6B and others in production might lead to potential harm to the end user experience. (5) The considered dataset for fine-tuning Movielens is licensed under Apache 2.0. The considered LLMs
224
O. Sofronova and D. Zharikova
are licensed under MIT (GPT-2) and Apache 2.0 (GPT-J). Therefore, the proposed approach could be utilized for commercial use.
A
Generation Examples
Table 2. Examples of explanations generation by the considered fine-tuned models GPT-2 355M, GPT-2 1.5B, GPT-J 6B. For each model, examples of movies and dishes domains are presented GPT-2 355M
Not fine-tuned
Fine-tuned without prompting
Fine-tuned with prompting
keywords: murder, train, betrayal, helicopter, disguise, fugitive; actors: Ving Rhames, Jon Voight, Kristin Scott Thomas, Vanessa Redgrave, Jean Reno, Tom Cruise; genres: mystery, adventure, action, thriller; production companies: Universal Pictures; decades: 1990’s movies
Mission: Impossible is a 1990’s action thriller based on a 1990’s movie called Mission: Impossible. Ving Rhames, Jon Voight, Kristin Scott Thomas, Vanessa Redgrave, Jean Reno, Tom Cruise, David Koepp, production companies: Paramount Pictures, 1990’s movies, keywords: murder, train, betrayal, helicopter, disguise, fugitive, fugitive, fugitive, If you like fish, you like Tom yum, you like some fish, some items, some keywords, some items, some keywords,
Check out Mission: Impossible. You would find its genres are adventure, action and thriller, it is a 1990’s movie, and it has Ving Rhames, Jon Voight, Kristin Scott Thomas, Vanessa Redgrave, Jean Reno and Tom Cruise as some of the main actors, keywords murder, train, betrayal, helicopter, disguise and fugitive, and it seems like you would enjoy it If you like Tom yum, you might like this dish.
Claire Bloom is a detective in the 1960s. She works with her partner Alan Napier to solve crimes that take place in New York City. The film is about a man who goes to the police after his car is stolen from a parking garage. The police are unable to find the car’s owner, but they discover that the car has been stolen from an old bank that is in the same building as the police station. Claire and Alan investigate the case, which leads them to uncover a conspiracy involving the bank’s owner, a wealthy businessman named Thomas Crown Tom yum is a salad that you can eat on a hot day. The soup is made of chicken, rice, and vegetables. It is very popular in Southeast Asia. Breakfast soups are similar to soup, but they are made with eggs instead of chicken. Dessert soups are similar to soup, but they are made with ice cream instead of rice
I would suggest The Thomas Crown Affair, a crime, thriller, drama, romance. It has Faye Dunaway as one of the main actors, keywords kiss, 1960s, investigation, cigar-smoking and bank-robbery, Walter Mirisch as producer, 1960’s movies, so you will probably like it
You like movies that deal with romance, crime, thriller, drama, romance, and genres of crime, thriller, drama, romance, and genres of crime. You like movies that are set in 1960’s movies, with actors Faye Dunaway and Harry Carey Jr. Also, you like movies that are set in New York, with keywords “kiss” and “investigation.” The Thomas Crown Affair is a crime thriller, and it is rated R for language and violence
I would recommend you watching this video. It has some things you seem to enjoy, like soup and it has keyword fish, so you should watch it
You like soups, so you like tom yum. The soup is made of lemongrass, lime leaves, galangal, lime juice, fish sauce, chili pepper, and Thai origin. It is a hot soup, so you like hot. The soup is from asian, so you like thai. The soup is hot, so you like hot. This dish is from thai, so you like thai.
All Dogs Go to Heaven 2 is a 1990’s movie. It is rated G for good-versus-evil, numbered-sequel, and second-part. Vincent Price, Boris Karloff, John Hurt, Ernest Borgnine, Wallace Shawn, Alan Mowbray, Jim Cummings, Herbert Marshall, Charlie Sheen, Dom DeLuise, Deborah Kerr, and Charlie Adler are in this movie. The title of the movie is All Dogs Go to Heaven 2 You are given a movie and a user. You need to explain why the user could like the given movie based on information about user and movie. I am having trouble figuring out how to approach this problem. Any help would be appreciated
All Dogs Go to Heaven 2 is a 1990’s movie
You might like All Dogs Go to Heaven 2 if you like adventure, animation, musical, fantasy, romance and 1990’s movies
Tom yum is a dish that the user could like. It has title, type, serving, keywords, and origin
I think you would like this dish because it is similar to your type of food. Tom yum is a soup, ingredients are lemongrass, lime leaves, galangal, lime juice and chili pepper
Tom yum is a dish made of noodles, fish sauce, and chili pepper. It is a combination of Thai and Asian flavors GPT-2 1.5B
GPT-J 6B
References 1. Baymurzina, D., et al.: Dream technical report for the Alexa prize 4. In: 4th Proceedings of the Alexa Prize (2021) 2. Bubeck, S., et al.: Sparks of artificial general intelligence: early experiments with GPT-4. arXiv preprint arXiv:2303.12712 (2023) 3. Costa, F., Ouyang, S., Dolog, P., Lawlor, A.: Automatic generation of natural language explanations. In: Proceedings of the 23rd International Conference on Intelligent User Interfaces Companion, pp. 1–2 (2018)
Language Models Explain Recommendations
225
4. Harper, F.M., Konstan, J.A.: The MovieLens datasets: history and context. ACM Trans. Interact. Intell. Syst. (TIIS) 5(4), 1–19 (2015) 5. Herlocker, J.L., Konstan, J.A., Riedl, J.: Explaining collaborative filtering recommendations. In: Proceedings of the 2000 ACM Conference on Computer Supported Cooperative Work, pp. 241–250 (2000) 6. Kuratov, Y., et al.: Dream technical report for the Alexa prize 2019. In: Alexa Prize Proceedings (2020) 7. Li, K., Hopkins, A.K., Bau, D., Vi´egas, F., Pfister, H., Wattenberg, M.: Emergent world representations: exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382 (2022) 8. Link, A.: Large language model: world models or surface statistics? https:// thegradient.pub/othello 9. Sagyndyk, B., Baymurzina, D., Burtsev, M.: DeepPavlov topics: topic classification dataset for conversational domain in English. In: Kryzhanovsky, B., DuninBarkowski, W., Redko, V., Tiumentsev, Y. (eds.) NEUROINFORMATICS 2022. Studies in Computational Intelligence, vol. 1064, pp. 371–380. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19032-2 39 10. Shmaryahu, D., Shani, G., Shapira, B.: Post-hoc explanations for complex model recommendations using simple methods. In: IntRS@ RecSys, pp. 26–36 (2020) 11. Zhang, Y., Chen, X., et al.: Explainable recommendation: a survey and new perR Inf. Retrieval 14(1), 1–101 (2020) spectives. Found. Trends 12. Zharikova, D., et al.: DeepPavlov dream: platform for building generative AI assistants. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (vol. 3: System Demonstrations), pp. 599–607 (2023) 13. Ziegler, C.N., McNee, S.M., Konstan, J.A., Lausen, G.: Improving recommendation lists through topic diversification. In: Proceedings of the 14th International Conference on World Wide Web, pp. 22–32 (2005)
Analysis of Text Data Reliability Based on the Audience Reactions to the Message Source Igor M. Artamonov and Yana N. Artamonova(B) Neurocorpus LLC, Moscow, Russia [email protected]
Abstract. Classification between relevant and irrelevant data is one of the most important tasks in modern machine learning. This paper developes a semi-supervised classification methods that is based on partially defined information for a text source. The method is based on the analysis of the author’s engagement into the subject area and the combination of his assessment by readers with quality of published texts. To achieve this we used a combination of joint author and text analysis that allowed to significantly reduce workload for further data markup. The method uses a learning loop that balances class attribution probabilities for text data at the level of the most effective class separation for a given noise level in the data. It was found that different paths for estimating relevance probability for relevant and irrelevant records had to be used. Both likelihood and plausibility approaches were used to achieve acceptable level of classification. The method was developed in process of analysys of large number of texts from social networks. It showed high effectiveness on a large amount (>70 mln records, >8 mln unique authors) of confusingly similar text data. Keywords: text mining · text data relevance · text data reliability audience reactions · social listening analysis · data classification
1
·
Introduction
The Internet evolution has led to the emergence of a wide variety of information sources. At the same time a number of channels for transmitting and disseminating information have been increasing rapidly. These channels include, but are not limited to, e-mail lists, forums, social networks and internet media. The positive side of this was a sharp rise a number of authors within wide a variety of topics. Within a certain limit it can be said that everyone who had access to the Internet and posted at least one message can be considered as a source of the data. This content variety allowed analysts to go beyond the narrow professional circle and to use amateur content as a unique material for analysis. The reverse side of this was the rising problem of separating relevant and irrelevant information. Due to the significantly greater number of irrelevant data c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 226–235, 2023. https://doi.org/10.1007/978-3-031-44865-2_25
Text Data Reliability Based on the Audience Reactions
227
with respect to the useful ones they became a rather annoying problem. Their separation, filtering and classification had attracted the attention of researchers for a long time. One of the reasons for this was the intention to either make this process fully automated or include a noticeable level of automation. It should be noted that irrelevant information is heterogeneous both in its structure and in the objective function. It includes such varieties as spam, phishing, substitution, farming, fake, rumors and recently appeared deep fakes and generated texts. In this article, we limit the area of interest to identifying irrelevant information in product reviews and feedbacks. This process was carried out within the framework of social listening analytics, when data collected from various web sources were systematically processed to support business decisions. Social networks led to the emergence of a separate specific topic of detecting inaccurate feedback and reviews about a business, product or service [1]. One of the features of such reviews was that they are practically indistinguishable form good ones for a non-expert in the subject area. At the same time they significantly jeopardize the credibility of the entire decision making practice based on recommendations and reviews on the Internet. In the work [2] three main classes of unreliable reviews were identified. These classes are (1) unreliable or false reviews, (2) brand reviews with comments related to the brand only but with no feedback about the quality of the product (3) “non-responses”, which do not relate to subject text or contain paid advertising. There can be added a fourth class with an outright scam but with mentions of relevant brands, products or keywords. As a result, filtering out the obvious scams does not lead to unambiguous determining the relevance of messages. All three classes of unreliable reviews contain the information that is irrelevant one but mimics to be relevant. The study of publications with description of approaches used to solve this task showed that almost all methods of machine learning and deep learning were in demand. Depending on the specific task of classification of relevant and irrelevant data and the skill of the researcher the methods showed different degrees of accuracy. Many of them achieve satisfactory metric values and practical results in solving the task. Review papers of the methods studied [5] distinguish the following main methods of working with data: clustering, naive Bayesian method, K-nearest neighbors, support vector machine, XGBoost classifier, Neural Network, Decision Trees, Random Forests, Deep Learning. Some more sophisticated methods used emotional assessment and sentimental analysis. They were based on the hypothesis that in order to simulate “live” authorship and attract the reader it is effective to include additional emotionality in messages. To distinguish them they had to use minor differences in characteristics from sincerely expressed emotions. “Sentiment analysis of text analytics, which determines the polarity and intensity of feelings conveyed in the text, is currently used in false news detection methods either as the basis of the system or as an additional component” [4]. Sentiment analysis (SA) is based on natural language processing (NLP) techniques used to extract the user’s feelings and is
228
I. M. Artamonov and Y. N. Artamonova
presented in meta-analysis [5]. Methods of identifying specific emotionality show interpretable results on individual samples and their development continues. Despite the use of virtually all available machine and deep learning methods to classify the data it should be noted that none of the proposed methods reached a stage of a universal solution. Search for local solutions remains essential. The analysis of a large number of reviews showed that the relevance of content has some connection with its author (message source). The purpose of this publication is to describe and analyze the proven approach of separating irrelevant and relevant data in a set of confusingly similar large text data that were taken from social networks. We evaluate the applicability of authorship-based messages filtering based on a real data and develop its relationship to the relevance of messages for the subject area.
2
Data Structure and Analysis
The initial task was to analyse relevant for a given subject area data on basis of publications in social networks. The input was monthly posts extraction from social networks on a selected topic in *.xlsx format (1.5–2.5 mln messages per month). One post corresponded to the one record in the table. The key field for analysis was a post text. From the customer’s point of view, the attribution of data to both relevant and irrelevant could be positional. This means that the same record could be treated as relevant in one context and irrelevant in another. Records that were not relevant in any of the relevant contexts had to be excluded from the further analysis. A most distinctive feature of the text data was the similarity up to the indistinguishability of relevant and irrelevant messages. Different classes of messages could not be separated without professional expertise in the subject area. That required a thorough structure and contextual text analysis that is described in [3]. In addition to the text field, each entry had other fields associated with the post author. The only mandatory field for the author was its unique identifier in form of her/his page URL in the social network. Additional parameters could include a name of the author, a text description, a number of subscribers that the author had, the amount of posts he created by the given moment, the number of thematic posts and the number of comments from subscribers in the feed. At the first stages of development, these fields did not participate in the filtering. It was carried out exclusively on the basis of the text field. After performing a preliminary analysis, it became obvious that the features of the author may be related to the relevance of his posts. Therefore we implemented a simple filter proposed by the customer to analyze the description of author using regular expressions. In order to avoid deleting relevant entries, this filter covered only a small number of authors whose posts were subject to removal. At the same time, the other fields associated with the author remained unprocessed.
Text Data Reliability Based on the Audience Reactions
229
This led to the need for the analysis of these fields with further combining them into useful features. The proposed author related features were divided into three groups: – metrics derived from the content of posts (content scores); – metrics based on fields characterizing the author itself (impact scores); – composite metrics that combines data from both files (integral scores). Content scores showed how relevant the content produced by the author was to the target area. By that moment the algorithm to select records with controllable number of irrelevant messages was realized [3] and we assumed that it was possible to evaluate the quality of each author on basis of the presence or absence his records in the selection. If the author had a number of “good” (relevant) posts above a certain positive threshold then it was reasonable to consider him to be relevant one and return his rejected posts back. In reverse, if this number was below a certain negative threshold the program excluded his other messages from analysis. The task was to find these thresholds. Final content scores contained: – the number of “bad” publications (discarded in the final data); – the number of “good” publications (presented in the final data); – the content quality of the author defined as the ratio of relevant and irrelevant publications; – the total number of posts the author has in the current analysis cycle (usually a month). Impact scores characterized the author’s impact on the potential audience with an emphasis on how popular and active the author is. They also included characteristics related to the attribution of the author to the group based on his description. The last was crucial since the impact of author might differ within different groups. The only integral metric was an engagement, which was calculated according to the following formula: Engagement =
SumReactions P ost count ∗ Subscribers
Based on the proposed metrics a joint analysis of both the content and its authors was carried out. It turned out that without additional expertise the automatically classified list of authors was of a limited value. On the other hand, the amount of work performed by the expert was noticeably reduced by improving the preliminary segmentation and grouping of authors with highest impact scores. Displaying calculated content scores and the preliminary attribution to the content groups of the author and messages had also accelerated the expert analysis. It is worth mentioning that there were several groups of both relevant and irrelevant messages. The latter include messages with plain spam, irrelevant content, relevant in content but irrelevant in other factors etc. It allowed both fine grain filtering and significantly simplified markup.
230
I. M. Artamonov and Y. N. Artamonova
A feature engineering was conducted to find baskets into which the authors could be placed. Thresholds were calculated to classify the author and his posts as subject to inclusion in the analysis or exclusion from it. It was determined that unambiguous attribution to a certain group with only author related data was not possible. However this data could significantly improve an attribution. An example of such an analysis is shown on Fig. 1. The plot associates the engagement with the number of subscribers for various thematic groups (the names of the groups were removed due to NDA). A sample size included of 790,000 from three month source data.
Fig. 1. The dependence of engagement for various thematic groups of authors (Color figure online)
Two groups in Fig. 1 (orange and blue dots) are visually separated from other noticeably well. This allowed to assume that the fuzzy class assignment built on this ratio can serve as one of the filters. To check the hypothesis a linear regression was applied and the average variance for each group was calculated. To assess the characteristics of the distribution, the mean was subtracted to bring it to a stationary one and it was converted to lognormal to balance the heavy left (“few subscribers”) and light right (“many subscribers/influencers”) tails. The resulting distribution was tested by Shapiro-Wilk. None of the distributions turned out to be normal (alpha 100 in our case). An increase in the credibility of trust is directly related to a decrease in criticism of the opinion, which creates a dangerous situation for the entire data analysis system as a whole. In order to avoid the degeneration of the system, it is required to set up some barriers and turn on some “elevators.” In this project these functions were performed by the customer’s expertise and relatively wide filters that preserve as much relevant content as possible. In this case the author will stay in analysis until he got very low relevance scores and the expert could raise or lower the author’s rating to include him in the certain rating group. The method used supposed combining supervised learning with unsupervised techniques. Their simultaneous application differed from the traditional one, when clustering was used primarily to get some understanding of the data or to form some primary features. We applied clustering on partially marked-up data to identify specific non-relevant groups that are poorly cut off by filters configured to extract relevant messages. Such irrelevance assessment based on the sources made possible to significantly improve a level of classification of the relevance of information. The allocation of a dynamic pool of reference authors with high relevance rating, continuous level of publications and long enough presence in the analysis system allowed to integrate their posts as a part of supervised learning subsystem. Since data were replenished and updated on monthly basis, a group of dynamic text standards appeared inside the system. Within every time period this group formed a set of messages according to which supervised training took place. The results of this learning were used to assess relevance of other messages. Due to the dynamic assessment of likelihood and plausibility, areas of highlighted attention appear. It is possible to draw parallels between the mechanism proposed in the article and the attention mechanisms widely used in neural networks [10]. In our case, we have two areas of attention: relevant and irrelevant. Each area had its own confidence vectors formed on the basis of features. The final positioning of the message was a result of joint application or these vectors and context filters. A rapid increase in popularity of large language models and the spread of deepfake technologies [11] allows to generate messages that have all signs of relevance. These messages can be distinguished from the real ones only by complex features that are quite difficult to find. It became obvious that the proposed in
Text Data Reliability Based on the Audience Reactions
235
this article and in [3] methods have good prospects to successfully classify such texts. A combination of origin and text analysis can help to estimate a level of how fake the message is. On the other hand to be applied in such tasks the described methods for relevance level computation require significant refinements and improvements.
5
Conclusion
This paper describes a solution that was tested on a real project with more that 70mln records from social networks. The approach based of separation of relevant and irrelevant information in the case of a very similar structure of data content by usage of message sources analysis. We suppose that the method described can be successfully applied both for text and for other data which are dynamically updated and have multiple sources.
References 1. Crawford, M., Khoshgoftaar, T.M., Prusa, J.D., et al.: Survey of review spam detection using machine learning techniques. J. Big Data 2, 23 (2015). https:// doi.org/10.1186/s40537-015-0029-9 2. Dixit, S., Avinash, J.A.: Survey on review spam detection. Int. J. Comput. Commun. Technol. ISSN (PRINT) 4, 975–7449 (2013) 3. Artamonov, I.M., Artamonova, Y.N.: Multilevel separation pipeline for similar structure data. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) NEUROINFORMATICS 2022. Studies in Computational Intelligence, vol. 1064, pp. 455–465. Springer, Cham (2023). https://doi.org/10.1007/ 978-3-031-19032-2 47 4. Shubha, M., Piyush, S., Ratish, A.: Analyzing machine learning enabled fake news detection techniques for diversified datasets. Wirel. Commun. Mob. Comput. 2022, Article ID 1575365 (2022) 5. Kaur, G., Malik, K.: A comprehensive overview of sentiment analysis and fake review detection. In: Marriwala, N., Tripathi, C.C., Kumar, D., Jain, S. (eds.) Mobile Radio Communications and 5G Networks. LNNS, vol. 140, pp. 293–304. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-7130-5 22 6. Deng, J.: Email spam filtering methods: comparison and analysis. Highlights Sci. Eng. Technol. 38, 187–198 (2023) 7. Fraser, D.: The p-value function and statistical inference. Am. Stat. 73(1), 135–47 (2019) 8. Infanger, D., Arno, S.-T.: P: value functions: an underused method to present research results and to promote quantitative reasoning. Stat. Med. 38(21), 4189– 4197 (2019) 9. Mohan, V.: Preprocessing techniques for text mining - an overview. Int. J. Comput. Sci. Commun. Netw. 5(1), 7–16 (2015) 10. Zhaoyang, R., Scott, S.B., Martire, L.M., Sliwinski, M.J.: Daily social interactions related to daily performance on mobile cognitive tests among older adults. PLoS ONE 16(8), e0256583 (2021). https://doi.org/10.1371/journal.pone.0256583 11. Westerlund, M.: The emergence of deepfake technology: a review. Technol. Innov. Manage. Rev. 9, 40–53 (2019). https://doi.org/10.22215/timreview/1282
Adaptive Behavior and Evolutionary Simulation
Analysing Family of Pareto Front-Based Evolutionary Algorithms for PINNs: A Case Study of Solving the Laplace Equation with Discontinuous Boundary Conditions Tatiana Lazovskaya(B) , Dmitriy Tarkhov, Maria Chistyakova, Egor Razumov, Anna Sergeeva, and Veronika Palamarchuk Peter the Great St. Petersburg Polytechnic University, Polytechnicheskaya, 29, 195251 St. Petersburg, Russia [email protected] http://www.spbstu.ru
Abstract. An evolutionary algorithm based on the Pareto front to construct a solution to an ill-posed problem with multi-criteria is proposed. It incorporates information about the desired solution at different stages of training neural network models. The Laplace equation in the unit square with discontinuous boundary conditions is used as a case study. The algorithm is compared with the classical one, and significant advantages are demonstrated, the influence of hyperparameters on results is studied. Keywords: Pareto front · physics-informed neural networks · discontinuous boundary conditions · Laplace equation · multi-criteria
1
Introduction
In the field of process and object modelling, ill-posed physics-mathematical problems can be effectively addressed using neural networks [1–3], which have become increasingly popular in recent years under the name of physics-informed neural networks (PINNs) [4]. Multi-criteria optimization tasks require special attention in terms of identifying the optimal solution. Determining which conditions to prioritize and how to incorporate them into the problem-solving algorithm can be challenging. Additionally, it may be unclear whether all relevant data is available from the outset. To address these issues, a cohort of solutions can be constructed to enable the selection of a solution that meets the present criteria and can be updated with new information as it becomes available. Evolutionary algorithms are quite successfully used for such problems [5–7]. This work was supported by the Russian Science Foundation under grant no. 22-2120004, https://rscf.ru/project/22-21-20004/.. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 239–246, 2023. https://doi.org/10.1007/978-3-031-44865-2_26
240
T. Lazovskaya et al.
As a case study of multi-criteria problem, the Laplace equation, with a boundary condition that is discontinuous, is taken into consideration. The formulation described leads to instability in the solutions obtained by the commonly used Fourier method, whereas solutions generated using artificial neural networks are robust to first-kind discontinuities and exhibit relatively low error. The analytical solutions produced by the physics-informed neural networks (PINNs) in this study are generated by adjusting the weights of preselected basis functions during network training using an evolutionary algorithm. This study extends a previous work [8], which utilized the Pareto front as a foundation for evolutionary algorithms to develop neural network solutions that are informed by physics. In the previous study, a set of penalty parameters was generated to train the network, with each parameter value corresponding to an optimal solution. This approach resulted in a set of analytical solutions that approximated the Pareto front, which was the focus of our selection process. Additionally, the previous work explored the use of an evolutionary algorithm to construct a new Pareto front based on the initial one. Our current study builds upon these findings and further expands on the idea of using the Pareto front as a basis for evolutionary algorithms by embedding it in a more general evolutionary scheme.
2 2.1
Materials and Methods Problem Statement
Let us consider the Laplace equation given in the unit square Δu(x) = 0, x ∈ [0, 1]2
(1)
with discontinuous boundary conditions (BC) u(x, 0) = u(0, y) = 0, u(x, 1) = u(1, y) = 1, x, y ∈ (0, 1).
(2)
This task ill-posed and does not have an accurate analytical solution. If we consider the solution v(x, y) found by the Fourier method (in the case of truncation by 100 terms) it has the following characteristics: – the root-mean-square error of satisfying the boundary condition of the derivative of the function at 1000 trial points xi on the upper boundary (y = 1) 1000 ∂ 1000 δd BC (v) = 1/1000 ( v(xi , 1))2 = 33.74; (3) ∂x i=1 – the root-mean-square error of satisfying the boundary condition at 10 trial points xi , yi on each boundary 40 δBC (v) =
10 1 2 1 v (xi , 0) + v 2 (0, yi1 ) + (v(x2i , 1) − 1)2 + (v(1, yi2 ) − 1)2 = 0.02; 40 i=1
(4)
Pareto Front-Based Evolutionary Algorithms for PINNs: A Case Study
241
– the root-mean-square error of satisfying the Laplace equation at 1000 trial points xi on the unit square 1000 1000 δΔ (v) = 1/1000 (Δv(xi , 1))2 = 0.00. (5) i=1
The error δd1000 BC (v) is notably large, and additionally, the Gibbs effect is observed along the boundaries. Can this solution be considered optimal for problem (1)–(2)? If the boundary condition of the solution derivative is essential, the solution cannot be considered optimal. However, this function can still serve as an additional criterion for evaluating the quality of solutions obtained through alternative methods. Due to the requirement that the solution must be simultaneously equal to 0 and 1 at the corners of the square, the continuous solution we are constructing cannot satisfy these conditions, leading to expected significant errors near these points. However, it is desirable that the errors in other areas are minimal and the solution is comparable to those obtained through fundamentally different approaches (such as the classical Fourier method in our case). 2.2
General Neural Networks Approach (PINNs)
To solve problem (1)–(2), a population of physics-informed neural networks is constructed, where each individual is represented by a form of uN N (x, y, n, a) =
n
a1i b(x, y, a2i ),
i=1
where n is a number of neurons per a hidden layer, b(x, y, a2i ) is a basis function. The vector of parameters a is adjusted during network training, which involves multi-criteria optimization with respect to the mean-square errors corresponding to the expressions (4) and (5) mentioned above. Note that using radial basis neural networks is optimal when solving differential equations in a such type domain. This is because these networks allow for local approximation in a small neighbourhood of each point, without the need to determine values beyond the domain boundaries. 2.3
Family of Evolutionary Algorithms for PINN Training
In this study, we consider evolutionary algorithms for training physics-informed neural networks based on the Pareto front, including schemes of mutation, crossing, and selection of individuals according to specified criteria. Presented approach includes varying the penalty factor in the loss function to reflect the multi-criteria nature of the optimization problem. By constructing and training separate solutions for different penalty factor values, an analogue
242
T. Lazovskaya et al.
of the Pareto front is generated, and the best instances are selected using a chosen criterion. Furthermore, it is assumed that individuals for mutation can be selectively chosen from the initial population. Thus, this component of the evolutionary algorithm implements the principle of survival of the fittest individuals from the population. Figure 1 depicts the evolutionary interpretation of the procedure, which we refer to as the Pareto Mutation.
Fig. 1. Evolutionary interpretation of Pareto mutation. This diagram illustrates how an analogue of the Pareto front of solutions to the problem is constructed and how a new generation of solutions is selected through evolutionary processes.
Consider the general scheme of the evolutionary algorithm for the formation of a population of solutions. It is shown in Fig. 2. The concept of crossing involves reproducing a new generation with novel quantitative or qualitative characteristics, such as additional neurons or improved performance in a particular criterion. Methods for selecting individuals for crossing can also be included in the criteria block II. The idea is that by incorporating additional information, such as expert knowledge and measurements, it is possible to use different criteria that influence the final result at various stages of the algorithm. This enables the generation
Pareto Front-Based Evolutionary Algorithms for PINNs: A Case Study
243
Fig. 2. General scheme of the evolutionary algorithm for forming a population of solutions. The circles with numbers indicate points where different criteria corresponding to a specific problem can be introduced.
of solutions that satisfy specific conditions or account for the high importance of certain constraints. The next section provides an example and analysis of the results obtained using this approach.
3
Computational Experiments and Results
Here, we present one of the algorithms belonging to the aforementioned family and corresponding results. Work [8] demonstrated that neural networks can provide solutions to problem (1)–(2) without exhibiting the Gibbs effect. To evaluate the results of the current algorithm, we employ a special criterion in addition to visual inspection. The analytical solution obtained using the Fourier method has the largest error for the derivative at the boundary of the square when y = 1, as indicated by criterion of the form (3). This criterion is used to help determine the optimal solution for problem (1)–(2) .
244
T. Lazovskaya et al.
Pareto Mutation. Unlike [8] where copies of selected from the initial population were used, the Pareto mutation in this study case is more sophisticated. Firstly, n = 100 neural network individuals uiN N (x, y, 1) with one neuron are randomly generated and the general initial penalty parameter is calculated as λ1 =
n i=1
50 i δΔ (uN N (x, y, a, 1))/
n
40 δBC (uiN N (x, y, a, 1)) P1 .
i=1
Next, values λi , i = 1, . . . , P1 , are utilised as a penalty multiplier for the loss 50 40 (uN N (x, y, a, 1))+λ1 iδBC (uN N (x, y, a, 1)) and to select the best indifunction δΔ viduals from the initial population based on minimisation criteria. The mutation of selected individuals, i.e., the adjustment of parameters during network training to minimize the corresponding loss, occurs independently for a fixed number of K1 training epochs. The second selection is not carried out, that is P1 = P2 . The second Pareto mutation in the general scheme is identical besides the final one. Crossing. After training the current generation of networks, the neurons of individuals with numbers 1 and P1 are systematically added one by one to all networks. The external parameter a1i is then calculated using the principle of least squares. For each network, a new individual with the minimum error is selected from the population to be included in the next generation. Termination Condition. The termination condition for the algorithm is based on achieving a predetermined number N of neurons in the final population. Additionally, the final Pareto mutation 2 trains the network for K2 epochs and selects P2 = 3 individuals from the trained population who meet the following criteria: – – –
50 40 δΔ (ujN N (x,y,a,N ) δBC (ujN N (x,y,a,N )) + ; j 50 40 (uj maxj δΔ (uN N (x,y,a,N ) maxj δBC N N (x,y,a,N )) 50 j minj δΔ (uN N (x, y, a, N ); 40 (ujN N (x, y, a, N )). minj δBC
Experiments. During the study, we conducted 5 runs of the algorithm with the following parameter configurations given in Table 1. For each experimental condition, we selected the best solutions based on the Pareto front and imposed constraints on the maximum error values. Our experimental results demonstrate that the number of training epochs for individuals in all algorithm iterations, except the last one, significantly affects the quality of the solutions generated. Specifically, individuals with a parameter value of 100 exhibit the best performance in terms of the Pareto front for all finite populations. The corresponding scatter plot in Fig. 3 shows it for additional criteria, such as the agreement with the solution obtained by the Fourier method and the vanishing of the derivative at the upper boundary of the square.
Pareto Front-Based Evolutionary Algorithms for PINNs: A Case Study
245
Table 1. Algorithm parameter configurations for series of experiments P1 N 9
K1
K2
10 800 10, 20, 30 50, 100 400, 800
Fig. 3. Scattering diagram with respect to the errors of satisfying the obtained solutions to additional conditions for the derivative at the upper bound and corresponding to the analytical Fourier solution. Colour markers represent different values of the number of K1 epochs for neural network individuals (except for the final stage).
Regarding the final number of neurons, individuals with 10 neurons do not perform optimally, while there is no significant difference between individuals with 20 and 30 neurons. Next, let us compare the results obtained with classically trained networks of 10, 20 and 30 neurons and a fixed penalty multiplier calculated at the initialization stage. 5 runs of the algorithm have been conducted. None of the obtained solutions using the evolutionary approach met the constraints we set for the error of satisfying the Laplace equation. Therefore, we have been unable to select a solution that met our criteria.
4
Conclusions and Discussion
This paper presents a general scheme for an evolutionary algorithm based on the Pareto front with multiple points for additional criteria. We conducted a series of experiments on solving the problem of the Laplace equation in a unit square with discontinuous boundary conditions, using a representative of this algorithm family.
246
T. Lazovskaya et al.
Our analysis of the results showed that increasing the total number of neurons in the physico-informed neural network from 10 to 20 resulted in a significant improvement in the results, while there was no significant difference between solutions with 20 and 30 neurons of the hidden layer. The number of training epochs also had an impact on the results. Furthermore, we compared our proposed method with a neural network solution obtained with other methods in a classical way, and our method showed an overwhelming advantage. Overall, the presented family of algorithms based on the Pareto front is promising and requires further development.
References 1. Cai, S., Wang, Z., Wang, S., Perdikaris, P., Karniadakis, G.E.: Physics-informed neural networks for heat transfer problems. J. Heat Transf. 143(6), 060801 (2021). https://doi.org/10.1115/1.4050542 2. Lazovskaya, T., Malykhina, G., Tarkhov, D.: Physics-based neural network methods for solving parameterized singular perturbation problem. Computation 9, 9 (2021). https://doi.org/10.3390/computation9090097 3. Nguyen, T.N.K., Dairay, T., Meunier, R., Mougeot, M.: Physics-informed neural networks for non-Newtonian fluid thermo-mechanical problems: an application to rubber calendering process. Eng. Appl. Artif. Intell. 114, 105176 (2022). https:// doi.org/10.1016/j.engappai.2022.105176 4. Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707 (2019). https:// doi.org/10.1016/j.jcp.2018.10.045 5. Abualigah, L., Diabat, A.: A novel hybrid antlion optimization algorithm for multiobjective task scheduling problems in cloud computing environments. Clust. Comput. 24(1), 205–223 (2020). https://doi.org/10.1007/s10586-020-03075-5 6. Coello, C.A.C., Lamont, G.B., Van Veldhuizen, D.A.: Evolutionary Algorithms for Solving Multi-objective Problems. Springer, New York (2007). https://doi.org/10. 1007/978-0-387-36797-2 7. Lu, B., Moya, C., Lin, G.: NSGA-PINN: a multi-objective optimization method for physics-informed neural network training. Algorithms. 16, 194 (2023). https://doi. org/10.3390/a16040194 8. Lazovskaya, T., et al.: Investigation of pareto front of neural network approximation of solution of Laplace equation in two statements: with discontinuous initial conditions or with measurement data. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) NEUROINFORMATICS 2022. SCI, vol. 1064, pp. 406–414. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-19032-2 42
Unawareness as a Cause of Determinism Violation. A Metaphoric Model Vladimir B. Kotov and Zarema B. Sokhova(B) Federal State Institution “Scientific Research Institute for System Analysis of the Russian Academy of Sciences”, Moscow, Russia [email protected]
Abstract. The research models the functioning of an autonomous agent or group of such agents. The aim is to investigate the significance of unawareness as a main cause for the unpredictability of events. The model world looks like a cellular field (two-dimensional array). There are two types of subjects: agents and observers. Because of deficient perception the subjects have limited knowledge. Every agent has an energy reserve and certain information including the map of the passed way. The computer simulation is used to investigate the model. When dealing with an agent, several scenarios are possible: 1) the agent masters the working space fully, 2) the agent harnesses part of the space to meet his energy requirements, 3) the agent explores part of the space that cannot provide him with sufficient energy. To investigate the effect of the environment, we put agents in the working space of a complicated topology where cells make up two or more congruent two-dimensional arrays rather than one two-dimensional field. Additionally, we describe different mechanisms of the interaction of groups of agents and analyze possible results of such interactions. The study shows that even a closed system with deterministic laws can demonstrate unpredictability. Keywords: Unawareness · Determinism · Autonomous Agents
1 Introduction Laplace’s determinism [1] looks rather an attractive idea due to its self-sufficiency. According to this concept, the current state of a system (given necessary time derivatives) uniquely determines the future of the system (as well as its past). The determinism is an obvious thing for finite closed mechanical systems: it naturally follows from the equations of motion. The generalization of the determinism concept to more complex systems (systems that are infinite, unclosed, non-mechanical, social etc.) is not always justified. Our everyday experience says that we cannot predict changes of the environment and the behavior of beings in this world. Humans are inclined to underestimate the importance of the major source of uncertainty – insufficient information about the world and themselves (unawareness). The effect of external factors (supernatural forces) and internal factors of a person (free will) are often referred to. The explanation lies with the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 247–256, 2023. https://doi.org/10.1007/978-3-031-44865-2_27
248
V. B. Kotov and Z. B. Sokhova
hope of a human to get a benefit by learning to control these factors: to negotiate with supernatural forces and to control the will (his own will or other people’s will). Both the factors seem mysterious on account of incomprehension of underlying mechanisms and reason. All this is aggravated by various chaos theories [2–4] proving the unpredictability of complex systems. Referring to strict mathematical justification and being applied to real systems, these theories produce the illusion of a scientific approach and confuse the issue fully. The underestimation of the unawareness factor is determined by the misunderstanding of world cognoscibility thesis [5]. Although in theory any effect can be explained and analyzed, in reality a being acts in conditions of insufficient information about the environment which results from restricted sensory perception and storage capacity. Expectedly, it is unawareness (incomplete information) that is the main, and sometimes the only cause of unpredictability of events.
2 The Basic Model To build and study the model, we use an agent-based approach [6–8]. There are two types of subjects in the world: agents and observers. Agents act according to predefined rules: they move across a certain area and get vital energy from energy sources located at fixed points. Observers watch and draw conclusions about the efficiency of agents. Both agents and observers can be ill-informed due to perception deficiency. We can regard agents as preprogrammed robots, and observers as programmers who evaluate the efficiency of agents to improve the supervisory program. An observer doesn’t bring changes in the world of agents. He only places a robot at a particular point of the working space and turns it on. Next, the robot acts independently. Yet another type of subjects, supervisors, is possible in theory. Unlike observers, supervisors have complete information about agents’ world and can change it (e.g., open and close paths, move, remove or add energy sources, etc.). A supervisor models the determinism violation caused by “supernatural forces”. We don’t use this type in our model. Let us detail the model under consideration. The space and time are regarded as discrete quantities. The working space consists of adjacent cells (neighboring cells have a common boundary by definition). If cells form a regular structure (e.g., a rectangular in the two-dimensional case), it is convenient to use their (integer-valued) coordinates for indexing. Some cells have feeders: charging stations for robots and food (energy) source for agents (Fig. 1). Energy E in a feeder is replenished by dE at each time step until it reaches maximum E m . Energy capacity e of an agent (robot) can’t exceed em (given several agents, the index of the agent is added). If an agent is in a cell with a feeder, it takes as much energy as possible. In charging the feeder spends more energy than the agent receives. Difference d ec is needed to do the charging. If a sell with a robot has no feeder or it is already charged, the agent must spend energy d em to move to the neighboring cell at next time step. Given several neighboring cells, the next cell is chosen according to a determinate rule with account of the agent’s knowledge and energy. An agent’s knowledge includes travel records. When in a cell, an agent has information about the availability of a feeder, its energy reserve, and the location of neighboring
Unawareness as a Cause of Determinism Violation
249
Fig. 1. An example of the rectangular world of agents.
cells. The knowledge grows with getting to know the working place. An agent can draw simple conclusions from available information, e.g., he can predict the reserve of energy of a distant feeder (the rate of energy accumulation is the same for all feeders and can be known to the agent). With the minimal knowledge (only about visited cells) an agent uses the simplest decision-making algorithm, which we describe below. Given sufficient energy reserve (e > eth ) an agent takes to exploratory behavior. He goes to the nearest unexplored cell using the shortest way. When there are several ways, an additional criterion is used to choose one of them. The procedure of path selection can involve quasi random number sequence. This sort of sequence is not fully random: it is known in advance, so the procedure remains determinate. Having reached the unexplored cell, the agent continues to explore the neighboring cells and choose the next path with the determinate rule. If there are no unexplored cells left, the agent keeps imitating the exploratory behavior, moving across the familiar area along the path chosen with the determinate procedure. When the agent’s energy drops to the threshold value (e < eth ), the feeding behavior is activated. The agent begins moving to a feeder with a sufficient energy reserve (known or predicted) using the shortest way. If such feeders are not known to him, he moves to a known feeder with the greatest energy reserve. If no feeder is known to him, the agent continues the exploratory behavior. If energy e drops to the level that makes further movement impossible, the agent dies (gets deactivated). Of course, it is possible to improve the algorithm of an agent’s behavior to maximize his survivability. For instance, it is possible to provide an agent with the ability to stay when the exploratory behavior becomes pointless (when the whole working space is already explored). This would allow the energy saving - energy consumption is lower
250
V. B. Kotov and Z. B. Sokhova
at rest than in motion. It is possible to optimize the visitation of feeders by directing an agent to solve the optimization problem. However, these improvements are not relevant to our research. What is important for us is that an agent’s behavior complies with the determinate procedure and is predictable.
3 A Single Agent Let there be only one agent that is landed on an unfamiliar area. The area is a twodimensional array of cells (e.g., a rectangular). An observer can see the whole area and the agent’s actions. At first the agent explores the area by visiting unknown cells. The agent finds feeders where he replenishes his energy. When being hungry (e < eth ), the agent returns to one of the discovered feeders. Depending on the power characteristics of the agent and feeders, and the number and location of feeders, different effects of the agent’s activity are possible. With high power efficiency of the agent (i.e., with low energy consumption) and sufficient feed supplies (when there are enough feeders, and their energy replenishment rate is high and arrangement is convenient), the agent will explore the whole area and start imitating the exploratory behavior occasionally paying his attention to feeding (Fig. 2a). The world, which was strange to the agent at first, becomes fully known eventually. Given intellectual resources, the agent can determine the pattern of his activity and predict future events. The determinism in this case is equivalent to full awareness. With lower power efficiency of the agent and/or poorer feeding supplies, a situation arises when the agent has to move around a concentration of feeders, being unable to go away from it to explore distant regions of the working space – the rapid energy consumption makes him feel hunger and return to feed sources. The agent’s world is the explored portion of the working space where all events are predictable (Fig. 2b). However, the uncertainty remains – there is a possibility to reach an unexplored cell. In general, this possibility is predictable (given intellectual abilities of the agent). The consequences are unpredictable. The agent can find a new feeder and continue explorations. Alternatively, he can die of hunger, not having time to return to a feeder. Under less favorable conditions the agent is sure to die of hunger, having explored just a portion of the area (Fig. 2c). The agent faces high uncertainty – he can hope to find a feeder till the last moment. It is necessary to point out that the tragic outcome is highly dependent on the initial location of the agent. The observer, if he can set different initial positions of the agent, plays the part of “supernatural forces”. However, it is pointless to negotiate with the observer – when the agent acts, the observer can’t control the situation. For an observer the succession of events is predictable, he knows an agent’s behavior beforehand. If there were a supervisor, he could not only predict, but also affect an agent’s actions. In all probability the supervisor would get bored with determinism and want to change the rules of the game, which would put the agent and observer to surprise. The effect of the supervisor is not considered in our research. Let us introduce unawareness by complicating an agent’s working space. Let cells form two or more congruent two-dimensional arrays rather than a single two-dimensional array (Fig. 3). We can say that the agent’s working space has several floors or levels. There are connections between the floors. Once at a connection point (cell) on one floor,
Unawareness as a Cause of Determinism Violation
251
Fig. 2. The density of visits of cells when an agent has explored a) the whole area, b) a portion of the area enough to meet his energy needs, c) a portion of the area insufficient to meet his energy needs (the agent dies). The darker a cell is, the more frequently it is visited.
an agent gets to another (predetermined) floor. The locations of inter-floor connections are predefined, but unknown to neither an agent nor an observer. An agent is unaware of passing to another floor and unable to know the floor number – all floors look alike to him. He can be fully unaware of the existence of floors. The world is a flat area in an agent’s mind. However, the placement of feeders is different on different floors. Getting to another floor, an agent discovers that the previously found feeder is gone; instead, a new feeder appears at a new location. It is necessary for him to explore the area over again. Carrying out extensive exploration, an agent can find all feeders on all floors. Of course, the conditions are supposed to be favorable, and an agent avoids the death from hunger. Depending on the frequency of inter-floor transfers, an agent can have different impressions. With frequent inter-floor transfers, an agent hasn’t much time to explore each floor. The floor maps merge into a single map with appearing and disappearing feeders. A longer exploration of the working space will allow the agent to work out the probabilities of feeder appearance. Agents can interpret the probabilistic behavior of feeders differently, because they have different abilities, knowledge and experience. In any case, the determinism of the world will be refuted.
Fig. 3. An example of a multi-storey working space.
252
V. B. Kotov and Z. B. Sokhova
With infrequent inter-floor transfers, an agent has more time to explore each floor and form its map. After coming over to another floor, the agent sees that the map is incorrect and starts composing the map again by exploring the area. If the agent has enough memory to store old maps rather than redraw them, he can find one of the maps suitable for use after another transfer. Besides, it becomes possible to identify floors (given a small number of floors). The uncertainty remains at the time of a transfer. This is related to the lack of knowledge of inter-floor transfer placement. Careful research with making and testing hypotheses is needed to overcome this uncertainty. This sort of work implies that the agent has an advanced intellect. The finding of transfer locations rehabilitates the determinism. Given a developed intellect of the agent (and favorable conditions), it is possible to surmount the unawareness and determine the real arrangement of the world in this case. Given a low intellectual level of the agent, the unawareness remains to an extent. The world is full of accidences of unknown origins. If the agent should learn to identify floors (with accidental inter-floor transfers), he may want to learn to identify them by indirect considerations. This is how superstitions – unfounded hypotheses called upon to lessen uncertainty – develop. As far as an observer is concerned, his unawareness stems from the fact that he sees all the floors as one. In the observer’s view, an agent moves over a single flat area which has feeders from all the floors. The observer sees that an agent not always replenishes his energy after entering a cell with a feeder. Moreover, the agent doesn’t go to where he should go according to the determined rule. The observer may draw a conclusion that the agent acts of his own free will, rather than follows a predetermined pattern. In reality, this is an illusion, and the observer can infer that the cause of it is his unawareness of the working space arrangement or the existence of hidden circumstances that influence the agent. Long observations can help discover the true cause of the unexplained behavior; yet the observer is unlikely to make considerable mental efforts to work it out. Note that in the case of a multi-floor working space, it is necessary to modify the algorithm that defines an agent’s behavior to allow for the conditional character of the exploration degree of feeding cells.
4 Several Agents Modelling the behavior of several agents requires the introduction of the agent interaction mechanism [9–11]. Let us suppose that agents interact only when they are in one cell or neighboring cells. In the simplest case agents don’t interact at all, each moving independently of the others. It is necessary to define how agents share the energy when they are in one feeding cell at the same time. It is natural to consider that agents have equal rights to energy (food). Accordingly, feeder energy reserve E less the operation expenditure is equally divided among agents, providing the resultant energy of each agent e doesn’t exceed em . Otherwise, the surplus energy is given to another agent (other agents). If the available energy reserve of the feeder is greater than the total demand of agents for energy, the excess remains at the feeder (the charging station). Even without the interaction, the mutual influence of agents can be rather significant. For instance, if an observer puts two agents consecutively at the same point, the second
Unawareness as a Cause of Determinism Violation
253
agent takes the same path as the first agent. The feeders the second agent visits turn out to be somewhat depleted by the first agent. Under certain conditions the second agent can die because of an energy shortage. From the agent’s point of view, it is his “bad luck”, which implies accidental effects. However, it is the determinism of the behavioral algorithm that makes the second agent experience the misfortune. Since an observer knows the placement of agents (and the working space structure), he can easily predict the agent’s fate. He can even prevent the agent from having ill luck by placing him at the other starting point. In this event, the observer can play a part of supernatural forces. Although he determines only the starting point and time for the agent’s activity, he influences not only his future but the future of other agents being present in the working space. In general, not knowing the placement of other agents is the primary cause of an agent’s uncertainty. Even if he has explored the whole area and known the locations of feeders (charging stations), he can’t know the amount of energy they have. The agent can only evaluate the energy stored in a feeder by assuming that after his last visit it hasn’t been visited by other agents or there’s been only one visit with a particular decrease of the energy reserve or the like. The reality can be quite different from the agent’s expectations. Unawareness lessens the survivability of the agent. The capacity of the working space (the number of agents able to subsist in it) proves considerably lower than the maximum value. The greatest capacity corresponds to the case when the reserve of each feeder keeps being replenished rather than remaining the same (at level E m ). It is a typical situation when agents use feeders that are located most conveniently and ignore other feeders. With a sufficiently large number of agents, convenient feeders can’t provide a necessary amount of energy, which results in the population of agents falling. The capacity of the working space can be increased by introducing the interaction of agents. The interaction is possible only when agents meet. Let’s leave such interactions as prohibition for some agents to be in one cell. This sort of interactions only complicates the matter and doesn’t contribute to the understanding of the problem. The interactions like antagonism or mutual aid are more interesting. An example of antagonistic behavior is the death of one agent when two agents meet at one cell. The strength of an agent is determined by his energy level e. A softer sort of antagonistic interaction is when a strong agent takes some energy from a weak one (which can also have fatal consequences for a weak agent, however). Though this kind of interaction leads to the death of some agents, it can be useful in terms of increasing the capacity of the working space because it results in a more even distribution of agents. From the point of view of an agent, the antagonistic behavior adds unpleasant surprises; now the living is not secured over the period only defined by the energy level. Life becomes fully unpredictable. Even the possibility to benefit from weak agents is doubtful payment. When two agents meet at one cell, reciprocal help proves more useful for the both or a weak agent. The strong agent can help the weak one by giving him part of his energy; this increases the chances for the weak agent to survive. This sort of help seems natural from the human point of view, yet it doesn’t raise the capacity of the working space (in
254
V. B. Kotov and Z. B. Sokhova
most cases it becomes lower). This fact is related to both the decreased energy of the strong agent and increased local population of agents. Another sort of reciprocal help – an exchange of knowledge – benefits both agents. The agents share travel maps to help each other to explore the area as quick as possible. When two agents meet, they can also share the information about the meetings with other agents. Extra knowledge lessens uncertainty and increase the survivability of agents. Note that with the unrestricted exchange of information, when two agents meet, they can have the same knowledge. In this event the decision-making algorithm can make the agents travel together (until the hunger makes the weak agent go towards a feeder). This results in unwelcome concentration of agents. To avoid this, it is necessary to provide for an algorithm that compels them to part after the exchange of information. By informing the others of their intentions, agents can coordinate actions to avoid gathering at one feeder. It is possible because the knowledge influences the choice of an agent’s actions. The exchange of information implies the advanced intellect of the agent. The intellect is needed to integrate information from other agents and use the accumulated knowledge to control his behavior. A picture of the world formed in the agent’s mind turns out to be rather complex: it includes both reliable information (e.g., the locations of feeders) and unreliable (hypothetical) knowledge (e.g., the placement of other agents). The mutual assistance introduces ethical elements into agents’ behavior. Helping other agents, an agent can even discover the “meaning of life”. The added knowledge and good relationship with other agents make the world look more friendly to an agent (provided enough food). Though not all uncertainty is gone, and accidents will happen, the world becomes more understandable and complies with known rules (true, these rules allow exceptions). Things are somewhat different for an observer. It is almost impossible for him to monitor the exchange of information among agents. If there is more than one agent, the observer is not fully aware of agents’ knowledge and cannot foretell actions of every agent even though he knows the action-choosing algorithm. At first the observer’s unawareness grows with the increasing role of knowledge. When multiple exchanges of information lead to the leveling of agents’ knowledge and establishment of common “ethical” rules of agents’ behavior, actions of agents become more predictable for the observer. However, rigid determinism is out of question now. There are deviations from averaged “righteous” behavior. If the observer should watch agents’ behavior for a long time, he can draw conclusions about the habits and behavioral characteristics of each agent. After that, he can foretell only the usual actions of agents. One can’t help remembering “free will”. We should point out that the formation of uncommon behavior calls for a high information capacity and advanced intellect of agents. It is the uncommonness of an agent (due to the effect of knowledge on behavior) that doesn’t allow an observer to trace the initial determinism. If a supervisor took over from an observer, he would probably not see the difference. From the point of view of an observer, the violation of determinism for a primitive agent deprived of information can result from his unawareness of the working space topology (given several floors and inter-floor transfers). When watching agents, an observer finds out that in some cases agents interact (e.g., one of agents is liquidated), and in other cases they don’t seem to notice each other (walking by peaceably). The observer doesn’t realize that agents can be on different floors. He can think that
Unawareness as a Cause of Determinism Violation
255
an agent can distinguish other agents and treat them differently. Or probably agents’ behavior is dependent on some hidden factors, including the “free will” of agents. The unawareness of the observer leads to wrong conclusions which then do not allow him to understand the situation. As we see, with insufficient information, an observer is also given to superstitions.
5 Conclusion The model world we have considered seems rather primitive. Nevertheless, it demonstrates that many violations of determinism can be explained by unawareness of subjects. Neither the interference of external forces (including irrational ones), nor intricate phenomena like quantum uncertainty or quantum chaos are needed. Even a closed system with determinate laws can exhibit unpredictability. Both agents, which belong to the system, and outside observers, which don’t belong to the system, are subject to unexpectedness. Being a result of restricted sensory abilities, insufficient information is a prerequisite for unpredictability. Subjects make hypotheses to explain unintelligible events. Some of the hypotheses are correct, much more of them are unsubstantiated. A hypothesis can be considered as satisfactory, even if it doesn’t have enough evidence. Anyway, the gap between reality and supposition (perception) should be filled out. This is how wrong concepts and various superstitions are borne. The question arises. Isn’t the unawareness a primary (if not the only one) cause of violation of determinism in real social systems? Of course, we can’t exclude the importance of external factors. After all, real systems are not closed. By way of example, we can consider the effect of the weather and climate factor on a community of, say, bees or ants. But most external factors can be monitored and investigated. It is often easy to allow for this sort of factors. It’s much harder to deal with inexplicable factors, some of which can’t be investigated at all. It makes sense to take a closer look at real systems and try to exclude at least some of these factors. Let us note the role of an observer. In our model his influence on the system is restricted by introduction of agents (robots) into the system. The reverse influence is possible: the system can cause the observer to form false ideas which affect his mentality. In any case the observer has more information about the system than agents. In view of real social systems, it looks useful (from the standpoint of agents) to secure the passing of information from the observer to agents. In practice it can be done by appointing one (or several) agent an observer and providing him with necessary technical means and livelihood (food, energy, etc.). Examples of such appointments are easy to find in the real world. As for a supervisor, who is missing in our model, his efforts to control a rather complicated system require huge information resources. Neither an agent, nor an observer can cope with supervising functions. Of course, an agent can control environment changes (beyond the framework of the model). Yet an agent can’t foretell the result of his actions for sure. The same is also true for an observer if he doesn’t have complete information. It is possible to provide him with an ability of changing the working space. Yet it will take time to carry out necessary observations and understand whether the change gives desired results. In general, the idea of a supervisor having complete information about
256
V. B. Kotov and Z. B. Sokhova
the system seems fantastic for real large systems. The complexity of a supervisor is higher than that of the system because the supervisor not only models the whole system, but also has arbitrary access to all the numerous variables that describe the state of the system. Is there any point in such redundant and expensive duplication? It would be more reasonable to restrict the awareness of the supervisor, allowing him to influence the system parameters. At the same time, the full predictability of the results of the supervisor’s activity can’t be secured. In most cases an efficient observer can substitute the imperfect supervisor like that. Funding. The work is financially supported by State Program of SRISA RAS No. FNEF-20220003.
References 1. Laplace, P.S.: Essai philosophique sur les probabilités. Courcier, Paris (1814) 2. Ott, E.: Chaos in Dynamical Systems. Cambridge University Press, Cambridge (2002) 3. Wiggins, S.: Introduction to Applied Dynamical Systems and Chaos. Springer, New York (2003) 4. Kiel, L.D., Elliott, E.W.: Chaos Theory in the Social Sciences. Perseus Publishing, New York (1997) 5. Russell, B.: Human Knowledge: Its Scope and Limits. Routledge (2009) 6. Bonabeau, E.: Agent-based modeling: methods and techniques for simulating human. In: Proceedings National Academy of Sciences of the United States of America, vol. 99, pp. 7280– 7287 (2002) 7. Shoham, Y., Leyton-Brown, K.: Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University, Cambridge (2008) 8. Gilbert, N.: Agent-Based Models. Sage Publications, Inc. (2007) 9. Genesereth, M., Ginsberg, M., Rosenschein, J.: Cooperation without communication. In: Proceedings of AAAI 1986, pp. 51–57. AAAI Press (1986) 10. Axelrod, R.: The Complexity of Cooperation: Agent-Based Models of Competition and Collaboration. Princeton University Press, Princeton (1997) 11. Red’ko, V.G., Sokhova, Z.B.: Model of collective behavior of investors and producers in decentralized economic system. Procedia Comput. Sci. 123, 380–385 (2018)
The Variable Resistor Under a High-Frequency Signal Galina A. Beskhlebnova(B)
and Vladimir B. Kotov
Federal State Institution “Scientific Research Institute for System Analysis of the Russian Academy of Sciences”, Moscow, Russia [email protected]
Abstract. Variable resistors (memristors) show promise as major building elements in development of artificial brain. Suitable means of controlling the resistor state can help overcome many problems in making practicable variable resistors. The use of high-frequency signals for controlling the resistor state is considered in the paper. An equation for the period-averaged state of a model resistor (simple resistor element) is found and studied. Under certain conditions the application of an alternating current or voltage is shown to cause the resistor to change its state, hence act as a memory cell. Stationary resistor states induced by the action of a strictly periodic signal are found. Though the use of a periodic signal gives a smaller range of resistor conductivity than the use of a unipolar signal, the conductivity proves to vary with the signal amplitude smoother, thus making the behavior of the resistor more predictable. In particular, it is generally possible to avoid bistability which results in the recording process depending on the initial conditions. Keywords: Variable Resistor · AC Signal · Simple Resistor Element · Stationary States
1 Introduction Today there is an understanding of the principles of building artificial human-level intelligence [1]. Nevertheless, the practical implementation of such neuromorphic systems is held back by unavailability of components needed to manufacture a large number of multiple-synapsis neurons. The emulation of neuromorphic calculations with different types of processors doesn’t solve the problem – a highly intellectual system turns out to be too expensive to manufacture and operate. Being simple, compact and energy efficient, variable resistors (memristors) stand out among the few candidates for building blocks of future artificial intelligence [2–4]. This kind of resistors can change their resistivity in functioning, basically by the action of the current flowing through them. Resistive elements can be regarded as carriers of the information that is recorded and changed by electric signals. Though the quality of today’s variable resistors is not sufficient for practical use, it makes sense to believe that existing difficulties can be overcome. In data recording constant-polarity currents and voltages are usually used. It is connected to the behavior of most interesting types of variable resistors: a positive current © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 257–266, 2023. https://doi.org/10.1007/978-3-031-44865-2_28
258
G. A. Beskhlebnova and V. B. Kotov
(the direction of the current agrees with the resistor direction) tends to increase the resistor conductivity, while the negative current decreases it. However, it is important to carefully select the parameters of the recording system (especially when the range of resistor conductivities is very wide). Spread of parameters of variable resistors and how these parameters change in operation have a strong effect on the results of recording. Moreover, unipolar signals are not very convenient to transmit - direct ohmic contact is needed. It looks reasonable to use high-frequency alternating signals. Note that this re-cording technique is not applicable to conventional memristors [5] – the changes in the memristor resistivity depend on the charge that has flown through the resistor. This charge approaches zero as the signal frequency increases. Fortunately, real variable resistors are much different from the Chua memristor. The paper considers the behavior of the variable resistor under the influence of a high-frequency alternating signal. The signal frequency is thought to be high enough to exclude considerable changes of the resistor conductivity over one period of oscillation. We also assume that variable resistors are unidirectional and well described by the simple resistor element model [6, 7].
2 Equations Resistance R of a simple resistor element is expressed through the only variable of state x: R = R(x). We assume that the state variable varies from 0 to 1. State x = 0 (ground state) is the state of highest resistance, and state x = 1 corresponds to the state of lowest resistance. The variation of the state variable complies with the equation: dx = F(x, I ), dt
(1)
where I is the current through the resistor. Let’s use the following representation [7] of function F: F(x, I ) = F0 (x) + Fx+ (x)FI+ (I ) + Fx− (x)FI− (I ).
(2)
Functions F0 (x), Fx± (x), FI± (I ) along with function R(x) are called characteristic functions of the resistor. The first (negative) term in the right side of (2) is responsible for spontaneous relaxation to ground state x = 0. The second term (it is positive when I > 0 and zero when I ≤ 0) corresponds to the increasing trend of variable x when the direction of the current matches the direction of the resistor. The third term (it is negative when I < 0 and zero when I ≥ 0) describes rapid relaxation to the ground state when the direction of the current is opposite to the direction of the resistor. The expression (2) can be regarded as main terms of the expansion of function F(x, I) with regard to current I. Herewith it is natural to assume that functions FI± (I ) are the following power functions: FI+ (I ) = A+ θ (I )I α+ , FI− (I ) = −A− θ (−I )(−I )α−
(3)
with positive coefficients A± and exponents α ± . Unanalyticity of the expansion (2) results from the difference of the processes at positive and negative currents. The exponents in (3) are usually considered to be units, yet other values are also possible.
The Variable Resistor Under a High-Frequency Signal
259
Characteristic functions F0 (x), Fx± (x) determine recording/erasing rates depending on the current state of the resistor. The knowledge of these functions allows us to draw conclusions about the details and the very possibility of data recording with an alternating signal. Functions F0 (x) = −f0 , Fx+ (x) = 1, Fx− (x) = 1
(4)
that are constant within the interval [0, 1] ( f 0 is a positive constant) can be taken as very rough approximation. At the same time, such functions cannot describe changes in the resistor state with enough detail. It is convenient to use an approximation that explicitly takes into account the behavior of these functions near boundary points x = 0 i x = 1. For F0 (x) let us write the expression [7]: F0 (x) = −f0 (x)(x1 − x)γ1 (x − x0 )γ0 ,
(5)
given x 0 ≤ 0, x 1 ≥ 1, f 0 (x) is a positive slowly changing function, γ 0 , γ 1 are nonnegative exponents. Function F 0 (x) is assumed to have no zeros within the interval (0, 1). We will use (5) with f 0 = const, x 0 = 0, x 1 = 1, γ 0 ≥ 1. F 0 (x) is zero when x = 0, and probably zero when x = 1 (if γ 1 > 0). For functions Fx± (x) we use the expressions: Fx+ (x) = (1 − x)β+ , Fx− (x) = xβ−
(6)
with β + ≥ 1, β − ≥ 1. Note that the two last conditions together with condition γ 0 ≥ 1 secures that the representing point x does not leave the interval (0, 1).
3 Periodic Current Through the Resistor Let a periodic current flow through the resistor. By assumption the period is small enough to ensure insignificant changes in the resistor state. In this event function x(t) can be presented as x(t) = x(t) + x˜ (t),
(7)
where x(t) is a slowly (as compared to current oscillations) varying function, and x˜ (t) is a small oscillatory addition (Fig. 1). As this addition approaches zero with the increasing frequency of oscillation, it can be neglected. We can get the equation for x(t) by averaging the Eq. (1) over the oscillation period. It has the form: dx = F0 (x) + M + Fx+ (x) + M − Fx− (x) (8) dt where M + = FI+ (I ) , M − = FI− (I ) , the angle brackets denote the averaging over the oscillation period. For brevity and in view of the smallness of the oscillatory part, we omit the bar above x.
260
G. A. Beskhlebnova and V. B. Kotov
Fig. 1. The change of the resistor state under the influence of high-frequency harmonic current
Given power characteristic functions (3), for a harmonic current I (t) = I0 sin(ωt + ϕ0 ) (I 0 is the amplitude, ω is the circular frequency, ϕ 0 is the initial phase) we get: κ(α+ ) + α+ κ(α− ) − α− A I0 , M − = − A I0 2π 2π
(9)
G(x) ≡ F0 (x) + Fx+ (x)M + + Fx− (x)M − = 0.
(10)
M+ =
π where κ(α) = 0 d ϕ(sinϕ)α is a number of the order of unity for real values of the quantity α. In particular, κ(1) = 2, κ(2) = π/2, κ(3) = 4/3. Similar expressions can be written for a periodic signal of arbitrary waveform. In this event the numerical coefficients depend on the signal shape. According to (9), when α + = α – , the ratio M + /M − is determined only by the ratio of coefficients A+ /A– . The sign of the right side of (8), which we denote G(x), defines the change direction of state variable x: with G > 0 variable x grows with time, with G < 0 it decreases. State x monotonously tends towards the stationary state (stationary point). The internal stationary points are roots of the equation
The stationary point is stable if to the left of it is the area of growth of variable x (G > 0), and to the right of it is the area of decline of variable x (G < 0). Otherwise, the internal stationary point is instable. Beside internal stationary points, there can be boundary stationary points. Point x = 0 is a stable stationary point, if to the right of it (with x > 0) G < 0. Boundary point x = 1 is a stable stationary point if to the left of it (with x < 1) G > 0.
The Variable Resistor Under a High-Frequency Signal
261
Given constant characteristic functions (4), G is independent of x. The representing point moves at constant velocity G. When M + > f 0 – M – (the recording processes are more intensive than erasing processes) the representing point moves to the right and comes to the boundary stationary point x = 1 over a finite time. When M + < f 0 – M – (the recording processes are less intensive than erasing processes) the representing point moves to the left and arrives at the boundary stationary point x = 0 over a finite time. In an exceptional case of M + = f 0 – M – the representing point stands still: all points within the interval [0,1] are indifferent equilibrium points. The direction of motion of the representing point in the three above-mentioned cases can be presented as diagrams 0 ← 1, 0 → 1, 0—1. In the case of more realistic characteristic functions (5), (6), we get G(0 + 0) > 0 i G(1–0) < 0 (assuming that –M – > 0, M + > 0). The boundary points can’t be stable stationary points. On the other hand there is an internal stationary point. If (8) has a unique solution in the interval (0, 1), the corresponding stationary point is stable (a diagram 0 → x st ← 1). Can Eq. (8) have several roots? Though it is generally possible, cases of multiple equation roots are too exotic and not worth considering without good reason. In practically important cases, we can consider that there is the only stationary point (stable) x st , which the representing point x tends asymptotically to. Clear that with I 0 → 0, x st → 0. When the amplitude of the current in (10) is great, we can ignore the relaxation term. As a result, we get the equation to find the stationary point: Fx− (x) M+ . = −M − Fx+ (x)
(11)
The left side of (11) changes from 0 to +∞ with x changing from 0 to 1. The right side is positive and independent of x. The greater the right side of (11), the closer to the boundary point x = 1 the stationary point x st . When α + = α – , the right side is independent of the current amplitude: x st stops depending on I 0 at high current. Typically, the limiting value of x st at A+ ≈ A− is located somewhere in the middle of the range (0, 1). If α + > α – , the right side of (11) increases with the amplitude of the current, and the stationary point approaches the boundary x = 1 with the growing I 0 . In the opposite case of α + < α – the right side of (11) tends to zero for large amplitudes I 0 . That is why with I 0 increasing from zero, x st first grows from the boundary value x = 0, reaches the maximum, and then decreases, approaching x = 0. Examples of dependencies x st (I 0 ) are shown in Fig. 2. In this section the current flowing through the resistor is considered to be given. In reality, the variable resistor is part of an electrical circuit, and it is signals at the input (poles) of the circuit that are given. Rather than currents, voltages are usually used as input signals.
262
G. A. Beskhlebnova and V. B. Kotov
Fig. 2. Functions x st (I 0 ) for different ratios between constants α + i α –
4 Periodic Voltage Across the Resistor Let a variable resistor be coupled to a periodic voltage source u(t) through a seriesconnected fixed resistor with resistance r (connection resistor). The fixed resistor is used for limiting and measuring the input current. The current through the variable resistor is expressed in terms of a given voltage using Ohm’s law: I=
u R(x) + r
(12)
With periodic voltage u(t) dependence I(t) is not strictly periodic because of the changes of resistance R(x). However, these changes are supposed to be slow enough to use the formulae for periodic current. Unlike the previous section, multipliers M ± in the right side of (6) depend on the variable of state x. Fortunately, given power characteristic functions (3), this dependence can be factorized. As a result, we get the following equation: α+ α− ρ ρ dx + + − − = F0 (x) + m Fx (x) + m Fx (x) , (13) dt R(x) + r R(x) + r where m+ = A+ θ (u)(u/ρ)α+ , m− = −A− θ (−u)(−u/ρ)α− , ρ is a characteristic resistance which can be given as ρ = R(0) + r. Resistance ρ is introduced to match dimensions. For m+ , m− we can write the formulae that are similar to (9). In particular, α α m+ ∼ u0 + , m− ∼ −u0 − , where u0 is the voltage amplitude. Equation (13) is like Eq. (8), yet (13) holds additional R(x)-dependent multipliers. These multipliers, which would be unity at R = const, become rather considerable when the change of the resistance is large: R(0) R(1). Equation (13), just like Eq. (8), describes the monotonous motion of the representing point x towards stationary point x st . The internal stationary points are derived from the equation: G(x) = 0,
(14)
The Variable Resistor Under a High-Frequency Signal
263
where G(x) is the right side of (13). The presence of additional multipliers can result in additional stationary points appearing. Given constant characteristic functions (4) with α + = α – = α, the right side of (13) takes the form: α ρ , (15) G(x) = −f0 + m+ + m− R(x) + r The velocity of the representing point depends on x. When α ρ 1 < m+ + m− /f0 < , R(1) + r
(16)
Equation (14) has a solution in the interval (0, 1) giving stationary point x st . Conditions (16) can be satisfied only when m+ > −m− , which is usually equivalent to the fulfillment of inequality A+ > A− (the recording process is more effective than the erasing process). The internal stationary point is a point of unstable equilibrium: it separates the attraction regions of the boundary stable stationary points x = 0 and x = 1. The motion diagram of the representing point looks like 0 ← x st → 1. As long as inequalities (16) are met, x st decreases from 1 to 0 with the growing voltage amplitude, the attraction area of stationary point x = 1 widening. If the left inequality in (16) is violated (when the amplitude is small, or the erasing process prevails), there is the only stationary (stable) point x = 0. The relevant diagram looks like 0 ← 1. If the right inequality in (16) is violated (when the amplitude u0 is large), the only stationary (stable) point is x = 1, the pertinent diagram being 0 → 1. The presence of two stable stationary points in a certain amplitude interval allows us to speak about bistability. The possibility of bistability distinguishes the case of a given voltage from the corresponding case of a given current. In the case of more realistic characteristic functions (5), (6) the bistability is also possible. With m+ > 0 Eq. (14) always has a root in the interval (0, 1) because G(0) > 0, G(1) < 0. If this root is the only one, the corresponding stationary point is stable, and we have diagram 0 → x st ← 1. With u0 → 0 x st → 0 and large amplitudes dependence x st (u0 ) relies on the ratio of quantities α + /α – , just as in the case of a set current (Fig. 2). If α + > α – , the stationary point approaches the boundary x = 1 with growing u0 . If the opposite is true α + < α – , the departure of amplitude u0 from zero is accompanied by x st first growing from zero, reaching the maximum and then going down to zero. If α + = α – , x st (u0 ) is a monotonously increasing dependence with x st tending to a limit value less than unit (given m− < 0). If there are two roots, there is another boundary stationary point. The extreme stationary points are stable, while the middle point is not. The easiest way to get three stationary points is when m− = 0. This case corresponds to the case of the positive voltage whose bistability is well known [7]. Here G(x) = F0 (x) + Aκu0α
Fx+ (x) (R(x) + r)α
(17)
(symbol “ +” for the constants is omitted for brevity). Equation (17) can be easily solved with respect to u0 , which gives dependence u0 (x), the inverse of dependence
264
G. A. Beskhlebnova and V. B. Kotov
x st (u0 ):
u0 (x) =
f0 Aκ
1/α
xβ0 /α (1 − x)(β1 −γ )/α (R(x) + r)
(18)
(the expressions for characteristic functions are taken into account). If u0 (x) is a monotonous function, x st (u0 ) is a single-valued function, so there is only one stationary (stable) point whatever amplitude u0 is. If (18) is not a monotonous function, there are several stationary points (usually two or three) within a certain amplitude range. When β 1 > γ , we have u0 (0) = u0 (1) = 0, and u0 (x) > 0 for 0 < x < 1 (i.e., function u0 (x) has a maximum). When 0 < u0 < um (um is the maximum of function (18)), there are two internal stationary points. Besides, boundary point x = 1 is also a (stable) stationary point. The motion diagram is 0 → x st1 ← x st2 → 1. When u0 > um , we have the only stationary point x = 1 (the diagram is 0 → 1). The opposite inequality β 1 < γ is more real the case. In this event, the nonmonotonousness of (18) occurs only when full resistance R(x) + r falls with the growing x abruptly enough (otherwise, (18) is a monotonously increasing function). We leave step dependences R(x) securing non-monotonousness of (18) unexamined. We restrict ourselves to the linear dependence of the resistance on the state variable. Function (18) has local maximum and minimum if γ < β1 + α and the resistance undergoes a considerable change: (R(1) + r)/(R(0) + r) 1 (Fig. 3). When umin < u0 < umax (umin , umax are minimum and maximum values), we have three stationary (internal) points with the diagram 0 → x st1 ← x st2 → x st3 ← 1. When u0 < umin or u0 > umax , there is the only stationary point (the corresponding diagram is 0 → x st ← 1). When u0 → ∞, x st → 1.
Fig. 3. Dependences u0 (x) for α + β1 < γ and α + β1 > γ (β1 < γ )
Quantity m– not being zero makes it difficult to get a few stationary points because the corresponding term (the third term in the right side of (13)) moves G(x) in the vicinity of boundary x = 1 into the negative range, making the most favorable range inaccessible to stationary points. The bistability is possible, but in rather exotic conditions. For
The Variable Resistor Under a High-Frequency Signal
265
example, we can get more than one stationary point if we take dependence R(x) with an abrupt local drop (a step function). Another possibility is the case of α+ > α− (and the inequality should be satisfied with a solid margin), i.e., when the importance of the m− containing term falls rapidly with amplitude, provided the contribution of the relaxation term remains significant. Typically (when the characteristic functions are of the forms (5) and (6)) there is a single stationary point for any amplitude value. Given α+ ≥ α− , x st (u0 ) is a monotonously growing function. With the growing amplitude, x st tends to unit if α+ > α− , and to the limit value less than unit if α+ = α− . When α+ < α− the rise of amplitude u0 from zero is accompanied by x st (u0 ) first going up from zero to a maximum and then down to zero.
5 Conclusions So, it is quite possible to record information by changing the conductivity of a variable resistor using an alternating high-frequency signal. Recording can be done both by bringing the resistor to a stationary state, which depends on the amplitude of the alternating signal, and by small changes in the current state towards a stationary state. In the first method, the final state is independent of the initial state of the resistor and is determined only by the amplitude of the control signal (a current or voltage). In the second method, the final state de-pends on the initial state and control signal amplitude and duration (exposure). It should be noted that recording results in many cases may not be as considerable as in unipolar signal-based recording. For example, let x st ≈ 0.5. If the resistance is linearly dependent on the state variable and able to vary widely (R(0) R(1)), the recording results in the resistance becoming two times as large as the ground state. The change may seem insignificant against the greatest possible change. On the other hand, we achieve the increased accuracy (and convenience) of resistor state control, which at least partly compensates for the reduced dynamic range. In general, it does not make much sense to seek a very wide range of conductivity of a variable resistor associated with the unpredictability of its behavior. Precise control allows high recording capacity (a large number of levels) to be achieved with small change in conductivity. The practical absence of bistability, which makes the recording result de-pendent on the initial state, speaks in favor of using a high-frequency signal to control a variable resistor. With a unipolar signal, bistability is often the case (given sufficiently great changes in conductivity). It can bring about the unpredictability of the behavior of a resistor. The use of an alternating control signal allows us to get rid of this drawback. In principle, we can also achieve the stability of properties and predictability of behavior using a unipolar signal for recording, if we confine ourselves to using resistors with a small conductivity range. However, working with an alternating signal gives us more opportunity – the inclusion of capacitors in the circuit. Capacitors integrate well with variable resistors. Moreover, the capacitor is initially present in the variable resistor – the metal electrodes surrounding the dielectric layer can be regarded as a capacitor connected in parallel with the variable resistor. It is technologically easy to fabricate a series-connected capacitor as well. The presence of capacitors makes the
266
G. A. Beskhlebnova and V. B. Kotov
result of the action of an alternating signal dependent on the signal frequency. This offers us additional, very accurate means of controlling the resistor state, which is very helpful when we deal with very large arrays of variable resistors. Funding. The work is financially supported by State Program of SRISA RAS No. FNEF-20220003.
References 1. Kotov, V.B.: Brain Building for Amateurs. Center for Innovative Technologies, Ltd., Moscow (2015). (in Russian) 2. Adamatzky, A., Chua, L.: Memristor Networks. Springer, New York (2014) 3. Vaidyanathan, S., Volos, C. (eds.) Advances in Memristors, Memristive Devices and Systems. Springer, New York (2017) 4. Ju, K.S., Kim, S., Jang, H.W.: Competing memristors for brain-inspired computing. iScience 24, 101889 (2021) 5. Chua, L.: Memristor-the missing circuit element. IEEE Trans. Circuit Theory 18, 507–519 (1971) 6. Kotov, V.B., Palagushkin, A.N., Yudkin, F.A.: Metaphorical Modeling of Resistor Elements. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) NEUROINFORMATICS 2019. SCI, vol. 856, pp. 326–334. Springer, Cham (2020). https://doi.org/10. 1007/978-3-030-30425-6_38 7. Kotov, V.B., Yudkin, F.A.: Modeling and characterization of resistor elements for neuromorphic systems. Optic. Memory Neural Netw. (Inf. Optics) 28(4), pp. 271–282 (2019)
Modeling of Natural Needs of Autonomous Agents Zarema B. Sokhova(B) and Vladimir G. Red’ko Federal State Institution “Scientific Research Institute for System Analysis of the Russian Academy of Sciences”, Moscow, Russia [email protected]
Abstract. In this paper, a model of autonomous agents with basic biological needs is constructed and investigated. The population of agents functions in a cellular environment. Each agent has four needs: 1) safety, 2) food, 3) reproduction, and 4) research. The intensities of needs form the agent’s genotype and are expressed as values from the interval [0, 1]. The model was analyzed by computer simulation. It is shown that agents with motivations have advantages as compared with agents without motivations. The experiments also confirmed that the needs of food and reproduction are the most important for the survival of the population. Keywords: Autonomous Agents · Biological Needs · Biological Motivations
1 Introduction Biological needs are the basic needs of a living organism, necessary for its survival and normal functioning. A well-known physiological classification of needs was proposed by P.V. Simonov [1]. Simonov identified three groups of needs: 1) vital (“essential”), 2) zoosocial (“intraspecific interaction”), 3) self-development (“directed to the future”). At any given time, a living organism can have several needs active simultaneously. In this case, one of the needs can become the leading one. Then the motivation arises in the living organism – “an incentive, an impetus to purposeful behavior” [2]. This motivation directs the living organism to the satisfaction of the leading need. In this paper, we will build a model of autonomous agents with basic biological needs. Agent-based modeling, which is used in this work, is one of the methods of studying complex dynamic systems. Agents in such models interact with the environment, monitor the changes that occur in it and make decisions about how to act in this or that case [3, 4]. Agents can have the following properties: heterogeneity, limited rationality, location in space, the ability to learn, anthropomorphism, purposefulness, reactivity, sociality, availability of resource (energy) and memory [5–8]. Let us note some works in which modeling of autonomous agents with biological needs was carried out. In [9, 10], the authors proposed a model of an agent having four biological needs (hunger, thirst, urination, sleep) and four motivations corresponding to each of these © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 267–278, 2023. https://doi.org/10.1007/978-3-031-44865-2_29
268
Z. B. Sokhova and V. G. Red’ko
needs. The agent functions in an environment with an unlimited source of resources (food, water). The main goal of the authors is to develop an autonomous agent that has mechanisms that allow it to have its own needs and interests. Based on these needs, the agent dynamically selects and generates targets, which makes plausible behavior possible. There were also interesting works in this direction [11, 12]. The work [13] investigates the behavior of autonomous agents with food, reproduction, and safety needs and motivations that correspond to these needs. The agents in the model are trained using the reinforcement learning method. The paper demonstrates the formation of cycles of behavior in which the needs of food, safety and reproduction are consistently satisfied. Despite the fact that there is currently a significant amount of work on adaptive autonomous agents, there are few research models that explore basic biological needs. Questions about how a living organism will behave when it encounters different stimuli at the same time, whether the hierarchy of needs is the same in different individuals, how the personal experience of an animal affects the hierarchy of needs, remain open [14]. In our work, an attempt is made to answer the question – which hierarchies of needs are most stable in the community of artificial agents? For this purpose, a model of autonomous agents with “vital” biological needs, such as safety, food, reproduction and exploration, has been constructed and investigated.
2 Model Description Consider a population of autonomous agents functioning in a cellular environment. The number of cells is N x ·N y , and the world is closed: if an agent moves to the right one cell from a cell with coordinates {N x , y}, then it gets into a cell with coordinates {1, y}, similarly for other “borders” of the world. There can be only one autonomous agent in each cell of the environment. The time t is discrete: t = 1, 2, …. There are N F portions of food in the world. At the initial moment of time, portions of food are randomly distributed among the cells of the world. If there is a portion of food in the cell at the same time as the agent, then the agent can eat. When feeding, the agent eats the entire portion of food. At the end of each time step, portions of food instead of eaten by agents reappear in random cells, so the total number portion of food in the world is constant and equal to N F . Any cell can contain a predator. Moreover, if a predator is in a cell with an agent, then this agent “dies”. The number of predators is fixed, it is equal to N P . The value of N P is a model parameter. Predators can move around the world. The rules for moving predators will be described in Sect. 2.2. Each predator has a resource Rp , which is increased at eating the agent and lost when performing other actions. If the resource of a predator becomes less than zero and there are agents in the world, then this predator dies and a new hungry predator appears in a random place in the world. If there are no agents in the world (the population of agents has died out), then the predator does not appear, and gradually the population of predators disappears. The agent sees the situation in its cell and in four adjacent cells (right, left, top, bottom), namely, the agent sees whether there is food, another agent or a predator in these cells.
Modeling of Natural Needs of Autonomous Agents
269
Each agent has its own energy resource R, which is replenished when eating and lost when performing actions. An agent has four needs: 1) 2) 3) 4)
safety, food (feeding), reproduction, exploration.
We will assume that the intensity of these needs for each agent is determined by the values PS , PE , PR , PI . These parameters make up the agent’s genotype G = {PS , PE , PR , PI }. Genotypes of agents are set at the beginning of the life of the population (at the beginning of the calculation) randomly and their intensities are uniformly distributed in the interval [0, 1]. The agent’s genotype determines the hierarchy of the agent’s needs. Further, we will denote the needs as follows: “1” – safety, “2” – food, “3” – reproduction, “4” – exploration. In total, the agent can have 4! = 24 different hierarchies, for example, such: 1,2,3,4 or 4,2,1,3, in which the 1st or 4th needs are the most priority, respectively. It is possible that because of evolution, agents with an incomplete hierarchy may appear in the population, that is, agents who lack one or more needs. In this case, the missing need will be denoted in the genotype by “0”. For example, 1,2,3,0 means that the agent has exploration need disabled, and 3,2,0,0 means that the agent has safety need and exploration need disabled. One of the objectives of this study is to find out which genotypes (hierarchies) will be more stable in a changing environment. 2.1 Motivations of Agents Each need corresponds to a certain motivation, which stimulates the action corresponding to the need. The success of an agent depends on the hierarchy of its needs. Let us take a closer look at how motivations work for agents. Food motivation M E is active if the energy resource level R ≤ R0 , where R0 is the optimal value of the energy resource. The level of active motivation M E is equal to the value of PE (Fig. 1a). Similarly, for the motivation of reproduction M R . At R ≥ R1 , the motivation for reproduction is active and equals PR (Fig. 1b), where R1 is the level of the energy resource required for reproduction. The motivation value corresponding to the M S safety need is 0 if the agent does not see a predator in its own and in the nearest 4 cells, and is equal to PS if the agent sees a predator in one of these cells (Fig. 1c). We assume that an autonomous agent immediately dies (is eaten by a predator) if it is in the same cell with a predator. Based on the “uncertainty reduction” hypothesis, which assumes that many living organisms are constantly motivated to explore the environment [15], we will assume that the motivation for exploring M I is active at any time step and is equal to PI (Fig. 1d). Thus, every step of time, the states of the agent’s internal and external environment are analyzed and certain motivations are activated. Then the agent identifies two motivations: 1) “dominant” [16] – motivation that corresponds to the leading need, and 2) motivation that follows the dominant. That is, the agent has two active motivations every step of time, let us denote them mf and ms, which determine his behavior, where
270
Z. B. Sokhova and V. G. Red’ko
mf corresponds to the dominant, and ms follows the dominant. The dominant mf corresponds to the maximal motivation. The second ms motivation corresponds to the next largest motivation. For example, if an agent has a hierarchy of motivations of the type {M R , M E , M S , M I }, then the dominant mf corresponds to the motivation of reproduction M R , and ms corresponds to the motivation of food M E . If the agent cannot perform an action that satisfies the dominant motivation, then it will try to perform an action that satisfies the second motivation. If the second action cannot be performed, the agent will choose the “rest” action.
Fig. 1. The time dependencies of the M E food motivation and the M R reproduction motivation on the agent resource R (a, b); the time dependence of the M S safety motivation on the presence of predators in neighboring cells (0 and 1 indicate whether there are predators in neighboring cells at a given time step) (c) and the time dependence of the M I explore motivation (d).
2.2 Actions of Agents Let us take a closer look at the possible actions of agents and predators. Predator Behavior. A predator can be in two states: 1) “hungry” or 2) “full”. The predator’s state is determined by the value of its internal resource: if Rp < R2 , the predator switches to the “hungry” state, if the resource is Rp > R3 , then the predator switches
Modeling of Natural Needs of Autonomous Agents
271
to the “full” state. At the initial moment of time, predators are randomly distributed among the cells of the world. The initial resources of predators are evenly distributed in the interval [0, Pp ]. If the predator’s initial resource is less than R3 , then its state is “hungry”, otherwise it starts functioning from the “full” state. Predator behavior is described by the following rules. If a predator (hungry or full) sees an agent in its cell, it eats this agent. In this case, the predator’s resource increases by the value of Rp . If the predator is hungry and there is no agent in its cell, but the agent is in one of the neighboring cells, then the predator moves into the cell with the agent and eats this agent (the predator’s resource increases by the value of Rp ). If the predator is full and there is no agent in its cell, then the predator does nothing and loses a small amount of the resource k. Runaway from a Predator. When an agent sees a predator in a neighboring cell, its motivation for the safety of M E is activated. If this motivation becomes dominant, then the agent runs away from the predator into one of the nearest free cells (where there is no predator and no other agent). If there are no such free cells, then the agent may die if the predator moves into its cell in the next step of time. It is most preferable to run away in the direction opposite to this predator. If there are no free cells, then an attempt is made to perform an action corresponding to the motivation of ms, otherwise the agent does nothing, while its resource decreases by a small amount L. The value of L is a parameter of the model. Eating Food by an Agent. If an agent is in a cell with a portion of food and the dominant motivation is food, then it eats this portion completely. At the same time, its resource increases by the value of R. The value R is a parameter of the model. If there is no food in the cell, then the agent looks at the following ms motivation: it can be safety, reproduction or exploration. If it turns out that ms = 4 (exploration) and there is food in one of the neighboring cells (at the same time, the presence of a predator or other agent in neighboring cells is checked), then agent moves to the cell with food and eats. If there is no food in neighboring cells, then an action corresponding to ms is performed, if this action can be carried out, otherwise the agent does nothing and loses a small part of the resource L. Reproduction. To reproduce the agent, we will use replication – creating a copy of the agent. If the agent has a leading need for reproduction, then it replicates. In this case, a part of the agent’s resource equal to qR is transferred to the descendant, where 0 < q ≤ 0.5. The genotype of the descendant G = {PS , PE , PR , PI } is equal to the genotype of the parent up to small mutations (each of the values of PS , PE , PR , PI varies slightly, a random variable is added to it, evenly distributed in the interval. [–PM , + PM ]). The PM value is also a parameter of the model. In principle, it is possible to introduce different mutation intensities for different parameters, but in this version of the model we assume that these mutation intensities are the same for all parameters. The descendant agent is placed in a random free cell from the nearest 4 cells (if there is no other agent in them). If there is no such free cell, then a new agent is not born and there is a transition to the next largest ms motivation. If it is possible to perform an action corresponding to the motivation of ms, then the agent performs it, otherwise the agent does nothing.
272
Z. B. Sokhova and V. G. Red’ko
Exploratory Behavior. If the exploratory motivation has become the leading one, then the agent moves to one of the nearest four free cells (if there is no other agent or predator in it) in a random direction. If there is no such free cell in the immediate environment, then there is a transition to the next ms motivation. If the exploration is impossible, then there are no free cells nearby, and it will not be possible to satisfy the motivations of safety and reproduction either. There remains only motivation to food. If the agent’s next motivation after the leading one is food motivation and there is food in the cell, then the agent eats food. If the next motivation is safety or reproduction, then the agent does nothing. Note that it is possible that two agents can select the same cell in one time step to move to this cell, in this case, the order of choosing the agent number for choosing an action is important. In the described variant of the model, the agent numbers for the action were selected according to the order of their appearance in the population. Agents with smaller numbers had the advantage. Resource Consumption per Action. After performing any action, the agent’s resource decreases by L. The Death of an Agent. The agent dies if a predator eats it, and if is resource is less than 0.
3 Results of Computer Simulation Let us describe the results of computer experiments obtained for the model described above. The main parameters of the simulation were as follows: the initial number of agents in the world N A = 100; the number of time steps N times = 3·103 ; the intensity of mutations for agents PM = 0.05; the number of cells horizontally in the world N x = 30; the number of cells vertically in the world N y = 30; the number of cells with food in the world N F = 200; the number of predators in the world N P = 100 or 110; the increase in the agent’s resource at eating R = 1.0; the amount of decrease of agent’s resource at performing any actions L = 0.01; the threshold for nutritional requirement R0 = 10; the threshold for reproduction needs R1 = 5; the coefficient characterizing the part of the resource that the parent agent gives to the descendant agent q = 0.5; the thresholds that determine the switching of the predator between states “hungry” and “full”: R2 = 4 and R3 = 5, respectively; the amount of decrease of predator’s resource when performing movement or rest k = 0.1. All experimental results presented in this section are averaged over 1000 different calculations. The size of the world is 900 cells. First, let us look at how motivations affect the behavior of agents. We will conduct experiments for two populations: 1) for a population of agents with motivations and 2) for a population of agents without motivations. When choosing an action, an agent with motivations is guided by what hierarchy of motivations it has for the current situation, and agents without motivations choose an action randomly. Note that if an agent’s random action without motivation is reproduction, but at the same time its resource is less than the threshold R1 , then the agent rests, since there is not enough resource for multiplication. Figure 2 presents the dynamics of the number of agents in models with and without motivations. In this case, the number of portions of food in the world is small. It can
Modeling of Natural Needs of Autonomous Agents
273
be seen that agents with motivations function successfully in contrast to agents without motivations.
Fig. 2. Dynamics of the number of agents in the population with and without motivations, N F = 200, N P = 100.
In the future, we will study the results of functioning for agents with motivations. The following experiment demonstrates the impact on the population dynamics of turning off one of the needs. Figure 3 shows the simulation results for 1) agents with all needs, 2) for agents with safety need turned off, and 3) for agents with exploration need turned off.
Fig. 3. Dynamics of the number of agents with motivations for three cases: 1) agents with all needs, 2) agents without safety needs, 3) agents without exploration needs, N F = 200, N P = 100.
The number of agents in the model with the safety need turned off is significantly lower than in the model with all needs, and in the model with the exploration need turned
274
Z. B. Sokhova and V. G. Red’ko
off, agents do not survive. When the need for food or reproduction is turned off, agents also do not survive. The effect of needs was also tested for the higher number of food servings in the world (N F = 800). As a result, populations in which safety needs or exploration are disabled survive, and in population in which agents have no food or reproductive needs die out. It can be concluded that the needs of food and reproduction are the most important for the survival of the population. Figure 4 shows how the average level of needs in the population of agents with motivations changes over time. At initial time moments, agents can move freely, since many cells are not occupied; in this case, there are more agents with the needs to safety in the environment. Then the agents reproduce, and there are fewer free cells, the need for safety is more difficult to satisfy, and reproduction and feeding come to the fore (Fig. 4(a)). At the same time, more agents appear with the leading needs for food and reproduction. It seems somewhat controversial that the need for reproduction increases with a small number of free cells. Let us look at a simple example to explain this. Let there be a group of agents, which are occupied by several adjacent cells. Some agents have a leading need “reproduction”, and the other part “exploration”. If one of the cells is released (the agent in this cell dies), then the agent with the leading need for exploration can move into it, and the free cell will then be occupied by the descendant of the agent that has the leading need to reproduction. Thus, there will be more agents with a need to reproduction in this area, while an agent with a leading need for exploration may die and not give descendant. On Fig. 4(b) the number of predators is N P = 800, and the average level of need for safety is greater than the levels of other needs. Figure 5 shows how the ratio of hungry and full predators changes. In this experiment, full predators dominate. With an increase in the number of predators, a different picture is observed; in this case, the number of hungry predators is greater than the number of full ones. Figure 6 shows the distribution of agents by possible genotypes in the community for N P = 100 in initial and final (at t = 3000) populations. The total average number of agents in the population in this case is 789. In this case, 170 agents have the genotype “3214”, and 131 agents have the genotype “3124”, that is, agents with these genotypes replicated more often. There are also agents that have safety or feeding first. Note that agents have appeared in the population that have no need for exploration, for example, “3120” and “3210”. With an increase in the number of predators (N P = 110), the total number of agents in the population is small: 58 agents. Most of them have genotypes “1234” and “1324”, that is, safety is preferable; and feeding, reproduction and exploration take also place.
Modeling of Natural Needs of Autonomous Agents
275
Fig. 4. Dynamics of the average level of needs for feeding, exploration, reproduction and safety for a community of agents with motivations. For two cases (a) N P = 100 and (b) N P = 110. N F = 200.
276
Z. B. Sokhova and V. G. Red’ko
Fig. 5. The ratio of hungry and full predators. N F = 200, N P = 100.
Fig. 6. The distribution of agents by genotypes in the community of agents with motivations, N F = 200, N P = 100.
Modeling of Natural Needs of Autonomous Agents
277
4 Conclusion Thus, the paper presents a model of autonomous agents with basic biological needs and motivations. The results of this study show that the most important for the population are the needs of food and reproduction. The significant role of research and safety needs is also demonstrated. According to the authors, it is possible to improve the model by adding the ability for autonomous agents to build a map of the area during exploratory behavior and form a knowledge base to use it later. In this work, agents obeyed their biological needs, but in nature, living organisms can gain experience from their own experience and interact with each other. Adding the possibility of learning and social interaction (to report, for example, about predators or share a resource) can significantly increase the success of agents. Funding. The work is financially supported by State Program of SRISA RAS No. FNEF-2022– 0003.
References 1. Simonov, P.V.: A motivated brain. Higher nervous activity and natural science foundations of general psychology. Piter, St. Petersburg (2023). (In Russian) 2. Lakomkin, A.I., Myagkov, I.F.: Biological needs and motivations. Voronezh, VSU (1980). (In Russian) 3. Red’ko, V.G.: Modeling of cognitive evolution: toward the theory of evolutionary origin of human thinking. URSS, Moscow (2018) 4. Nepomnyashchikh, V.A.: Animats as a model of animal behavior. IV All-Russian Scientific and Technical Conference “NEUROINFORMATICS-2002”. Materials of the discussion “Problems of intelligent control − system-wide, evolutionary and neural network aspects”. MEPhI, Moscow (2003). (In Russian) 5. Gilbert, N.: Agent-Based Models. Sage Publications, Thousand Oaks (2007) 6. Hayes-Roth, B.: An architecture for adaptive intelligent systems. Artif. Intell.: Spec. Issue Agents Interact. 72, 329–365 (1995) 7. Wooldridge, M., Jennings, N.: Intelligent agent: theory and practice. Knowl. Eng. Rev. 10(2), 115–152 (1995) 8. Maes, P.: Artificial Life meets entertainment: life like autonomous agents. Commun. ACM 38(11), 108–114 (1995) 9. Avradinis, N., Panayiotopoulos, T., Anastassakis, G.: Modelling basic needs as agent motivations. Int. J. Comput. Intell. Stud. 2(1), 52–75 (2013) 10. Avradinis, N., Panayiotopoulos, T., Anastassakis, G.: Behavior believability in virtual worlds: agents acting when they need to. Springerplus 2, 246 (2013) 11. Sutton, A.K., Krashes, M.J.: Integrating hunger with rival motivations. Trends Endocrinol Metab 31(7), 495–507 (2020) 12. Burnett, C.J., Funderburk, S.C., Navarrete, J., et al.: Need-based prioritization of behavior. Elife 8, e44527 (2019) 13. Koval, A.G., Red’ko, V.G.: Behavior of model organisms with natural needs and motivations. Math. Biol. Bioinf. 7(1), 266–273 (2012). (In Russian) 14. Barajas-Azpeleta, R., Tastekin, I., Ribeiro, C.: Neuroscience: how the brain prioritizes behaviors. Curr. Biol. 31(19), R1125–R1127 (2021)
278
Z. B. Sokhova and V. G. Red’ko
15. Nepomnyashchikh, V.A.: Relationship between autonomous and adaptive behavior in artificial agents and animals. In: Red’ko, V.G., Collection of Scientific Papers: Approaches to Modeling Thinking, LENAND, Moscow (2019). (In Russian) 16. Ukhtomsky, A.A.: Dominant. Articles of different years 1887–1939. St. Petersburg (2002). (In Russian)
Study of Modifications of Gender Genetic Algorithm Gavriil Kupriyanov1(B) , Igor Isaev2,3
, and Sergey Dolenko2
1 Faculty of Physics, M.V. Lomonosov Moscow State University, Moscow, Russia
[email protected]
2 D.V. Skobeltsyn Institute of Nuclear Physics, M.V. Lomonosov Moscow State University,
Moscow, Russia [email protected], [email protected] 3 Kotelnikov Institute of Radio Engineering and Electronics, Russian Academy of Sciences, Moscow, Russia
Abstract. This study compares several modifications of a gender genetic algorithm (GGA). Aside the difference between the genders in the probability of mutation, we introduce two additional modifications: different implementations of selection and different laws of dependence of the probability of mutation on gene number within a chromosome. We use four test optimization problems in spaces of various dimensions to compare conventional GA, conventional GGA, and GGA with the additional modifications implemented separately or together. It is demonstrated that the proposed additional modifications outperform conventional GA and conventional GGA in the achieved value of the fitness function, especially in high-dimensional spaces. With increase in the problem dimension, they degrade more slowly. Also, the new modifications prevent premature convergence of the algorithm. Keywords: Genetic Algorithm · Gender Genetic Algorithm · Evolution Modeling · Optimization
1 Introduction Optimization problems are a specific class of problems, which are computationally expensive and depend much on the properties of the fitness function (FF) used to assess the quality of the solution. With increasing dimension of the optimization space (increase in the number of adjusted parameters), all optimization methods degrade in respect to quality and computational cost of the solution. There are several classes of optimization methods differing by their properties. Local evaluation methods impose no limitations on the calculation of the FF, but they have low computational efficiency and highly depend on the starting point. Local gradient methods are highly efficient, but they are subject to be stuck in local minima, and they also highly depend on the starting point. Exhaustive search methods ensure the finding of the global extremum, but their computational cost is usually unacceptable, especially in high-dimensional spaces. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 279–289, 2023. https://doi.org/10.1007/978-3-031-44865-2_30
280
G. Kupriyanov et al.
Possible compromise lies in the use of population-based methods. In these methods, a number of possible solutions of the problem are considered in parallel, thus providing significant reduction in the risk of premature convergence into a local extremum and efficient search of better areas of the search space. One of the most popular subclasses of population-based methods are genetic algorithms (GA) [1], based on the same fundamental principles as natural evolution according to Darwin theory: survival of the fittest (natural selection) and inheritance of properties. The basic components of GA are the procedure of selection and the genetic operators of crossover and mutation. In conventional GA, all individuals are treated equally during selection and application of genetic operators. However, there is some contradiction between the necessity of preserving the achievements of the population (and search in the vicinity of the already found extrema) and the need to explore new areas in the search for possibly better solutions. Higher mutation rate ensures better exploration capabilities, but contradicts the strategy of preserving the achievements. The strategy of elitism (passing one or several best individuals to the next generation without applying any genetic operators) does not fully compensate the negative effect of high mutation rate on the ability of the population to converge. In nature, in higher mammals this problem is solved by dividing the individuals into two genders differing by their attitude to selection and to reproduction. The crossing is performed between individuals of different genders, and selection is performed within each gender separately. The typical mutation rate for males is usually several fold higher than for females, thus making females responsible for preservation of the achievements, and males responsible for exploration. Gender genetic algorithms (GGA) model this paradigm in the evolution of the solutions of optimization problems. The whole population is divided into two groups (genders), one of which is intended to preserve the acquired useful features (female individuals), while the other one is aimed at exploring new solutions areas (male individuals). Primarily, the desired effect is achieved by setting the probability of mutation significantly different for different genders (much higher for the male gender). Another significant feature that may distinguish the genders from each other is the procedure (operator) of selection. The variety of modifications of the selection operator in the literature is quite large. Selection operators may consider the age of an individual [1]; the fitness values of the parents of the individual [1–3]; the relationship between individuals [4]; the proximity of the crossing individuals in the space of the optimized parameters [3]; and other properties of the parents [5]. Use of different selection procedures for male and female individuals has been reported in [6, 7]. Various other realizations of GGA have been also suggested and investigated [8–10]. Separately noted should be diploid variations of GGA, which were intended to model in the algorithm the biological evolution of diploid organisms [11, 12]. In this study, besides the initial difference between the genders in the mutation rate, we test two additional modifications of GGA, broadening the difference between the genders. The modifications are tested on four well-known multi-extremum test function in spaces of increasing dimension.
Study of Modifications of Gender Genetic Algorithm
281
2 Modifications of Gender Genetic Algorithm 2.1 Conventional Genetic Algorithm The version of the conventional GA used in this study provided sequential application of selection, crossover, and mutation operators, and also included the elitism strategy. We consider GA with continuous chromosomes, with N-bit binary encoding. This means that the range between minimum and maximum allowed values of each parameter is encoded by binary representation of 2N equally spaced values; the search is performed in the space of these values. The selection operator used in conventional GA for all individuals was the roulette wheel operator, in which the probability of selection is proportional to the value of the FF of the individual (the better is the FF, the higher is the probability of the individual to be selected). The crossover operator used was the single-point crossover, in which the corresponding chromosomes of the two selected parents are divided into two parts in the same randomly selected point; the higher part of one parent and the lower part of the other one are combined to form the chromosome of the offspring. Each chromosome of an individual (corresponding to one of the optimized parameters) is crossed independently. The mutation operator was the single-bit inversion applied with some small probability to one of the chromosomes of the offspring. The elitism strategy provided passing of a small fraction of the individuals with the best FF values in the population to the next generation unchanged. 2.2 Gender Genetic Algorithm The basic version of the GGA used in this study included the division of population half by half into two genders with different mutation probabilities. Female individuals implementing the preservation strategy had a low probability of mutation, while the male individuals implementing the exploration strategy had a high probability of mutation (Table 1). The mutation operator used for GGA was the same single-bit inversion as for conventional GA. The selection was performed separately for each gender. Opposite to our preceding studies [13, 14], no limitation was imposed on the number of selections of any individual within one generation. The crossover was always performed between individuals of different genders. Each crossover produced two offspring (one male and one female chosen at random between the two). The elitism strategy was implemented in the same way as in conventional GA separately for each gender. 2.3 Selection Operator Modification In this study we consider the following modification of the selection operator. As described above, the roulette wheel (RW) selection operator consists in selecting an individual with a probability proportional to its FF value (Fig. 1). This contributes to
282
G. Kupriyanov et al.
the preservation of the found good solutions and to the fast convergence of the algorithm; however, it negatively affects the diversity of the population, which leads to getting stuck in local extrema more often. The tournament selection operator consists in randomly taking a certain number of individuals and choosing the one with the best FF value among them (Fig. 1). This contributes to increasing the diversity of the population, increasing the probability for individuals with low FF values to move into the next generation, but slows down the convergence of the algorithm. Thus, in the modified version of selection, for females implementing the preservation strategy, the RW selection operator was used, and for males implementing the exploration strategy, the tournament selection operator was used (Table 1). 2.4 Mutation Operator Modification The modification of the mutation operator was based on the fact that for binary chromosomes, a mutation in a higher bit changes the solution more strongly than a mutation in a lower bit. Thus, for female individuals implementing the preservation strategy, the probability of mutation in the lower bits was set higher than in the higher bits, and vice versa for males. In order to maintain a uniform distribution of the mutation probability for the population as a whole, in our study we used linearly decreasing and linearly increasing mutation probabilities with increasing bit number for female and male individuals, respectively (Fig. 1) (Table 1).
Fig. 1. Selection and mutation operators.
Study of Modifications of Gender Genetic Algorithm
283
3 Computational Experiment 3.1 Test Optimization Problems As test problems, we used the Ackley and Griewank functions, which are characterized by a large number of distinct local extrema, as well as the Bohachevsky and Rosenbrock functions, which are characterized by a large number of local extrema with close values of coordinates and amplitudes [15]. All the functions were normalized so that the best FF value was equal to 1. To test all the considered modifications of GA for tasks of varying complexity, these functions were considered with dimensions of 2, 4, 8, 16, and 32. 3.2 Evaluation Metrics The following statements of an optimization problem are possible: 1. Find the very best solution in a nearly unlimited time; 2. Find a satisfactory solution in the shortest time; 3. Find a reasonably good solution in a reasonable time. Thus, we used the following metrics to evaluate the work of the algorithms: • The number of generations processed by the algorithm up to the generation of convergence, when the stopping criterion was fulfilled. • The best value of the FF obtained at the generation of convergence. • The general dynamics of the FF during the specified number of generations. 3.3 Statement of the Computational Experiment During the computational experiment, the following variants of genetic algorithms were compared: • • • • •
GA – conventional genetic algorithm GGA – ordinary gender genetic algorithm GGA–MS – gender genetic algorithm with modified selection operator GGA–MM – gender genetic algorithm with modified mutation operator GGA–MMS – gender genetic algorithm with both modified selection operator and modified mutation operator
The parameters of genetic algorithms are shown in Table 1. The results of each experiment were averaged over 100 independent runs, also used to calculate the standard deviation.
284
G. Kupriyanov et al. Table 1. The parameters of the considered variants of genetic algorithms.
Parameters of genetic algorithms
Variant of genetic algorithms GA
GGA
GGA-MS
GGA-MM
Population size
100 individuals
Elite individuals
6%
Stop criterion
Less than 10–6 increase in the FF maximum value over the last 20 generations
Chromosomes
Continuous chromosomes with binary encoding, 25 bits
Crossover operator
Single-point, two offspring
Selection operator
RW
Mutation operator
Single-bit inversion
Mutation probability
1%
RW
Males: Tournament; Females: RW
RW
GGA-MMS
Males: Tournament; Females: RW
Males: 5%, Females: 0.1%
Uniform for all bits
Linearly increasing for males (decreasing for females) with bit number
4 Results Figure 2 displays the dynamics of the values of the normalized FF for two of the test functions – Ackley and Griewank in their 32D implementation. Vertical lines mark the generation of convergence (when the stop criterion is achieved). However, to observe the total dependences of FF on the number of generations, evolution was continued after convergence until the same number of generations when the level where the FF values saturate becomes clear. Light colored corridors around the lines show the standard deviation over the 100 runs of each algorithm. Both functions are characterized by a large number of narrow peaks situated quite close to each other. For Ackley function, the area of the global extremum is also narrow, while for Griewank function the shape of the envelope passing through the set of neighboring extremes around the global one is quite flat. (Note that in this respect the two other test functions also lack a narrow area of global extremum.) It can be seen that the dynamics of FF is different for different types of functions. For the Ackley function with a narrow area of global extremum we observe premature convergence of all the variants of GA. This is an evidence that for such functions to achieve the goal in the third statement (find a reasonably good solution in a reasonable time), the stop evolution criterion used should be less tough (e.g. no increase in the FF maximum value at all during a larger number of generations).
Study of Modifications of Gender Genetic Algorithm
285
Fig. 2. Dynamics of the normalized fitness for 32D Ackley and Griewank functions. Vertical lines mark the generation of convergence.
Conventional GA converges first of all; if evolution is continued, improvement in the achieved FF value happens most slowly, and the curve tends to the smallest value most far from the global extremum. The fastest FF growth is demonstrated by GGA-MMS, suggesting use of this strategy for the second problem statement (find a satisfactory solution in the shortest time). The best dynamics is observed for GGA-MS: FF growth is quite fast, modified selection efficiently prevents degeneration of population, causing the latest convergence and the highest achieved level of FF. Conventional GGA degenerates quickly; however, in case of soft convergence criterion and long evolution it allows achieving high FF values. For Griewank function and other functions with a wide area of global extremum the situation is to some extent opposite. No premature convergence is observed for either variant of the algorithm; the stop evolution criterion used turns out to be capable to trigger in due time, when the detected maximum FF value stops growing. Conventional GA performs worst again: it converges last and at the lowest FF level. The best results (fastest conversion to the highest FF levels) is demonstrated by both algorithm variants with modified mutation. Now, let us compare the behavior of the GA variants when the complexity of the optimization problem increases – in the spaces of growing dimension (Fig. 3), at the example of Ackley function. We can see that with increasing complexity of the problem (the dimension of the optimization space) all the variants of GA degrade, taking a greater number of generations to converge and finishing evolution with FF values further from those of the global extremum. However, various GA variants behave in a different way. The best FF values are achieved with variants using modified mutation (GGA-MM and GGA-MMS). In the most complex 32D space this requires the largest number of generations. In the spaces of lower dimension, GGA-MMS employing both selection and mutation modifications is one of the leaders in respect to the number of generations required to converge.
286
G. Kupriyanov et al.
Fig. 3. Generation of convergence (left) and normalized fitness value achieved (right) in the spaces of growing dimension. Standard deviations are calculated over 100 runs of each algorithm.
It is interesting to display both parameters (normalized fitness and convergence generation) as one parametric diagram, making the dimension of the problem space to be the parameter. This is done in Fig. 4 for all four functions considered; the direction of growth of the task dimension is indicated by an arrow. Such diagrams are very clear: the curves for better performing variants of GA go higher, achieving better FF values in the same number of generations, and further to the right, demonstrating slower degradation of the population and/or slower convergence. The parametric diagrams demonstrate an obvious leader (GGA-MM) with the highest values of FF achieved, and an obvious outsider (conventional GA) with the lowest FF values and either quick degradation or very slow convergence. In the whole, modified mutation strengthens the effect of specialization of the genders, allowing the population to achieve better FF values. Modified selection accelerates algorithm convergence, possibly at the expense of a reduction in the FF value. Therefore, recommended use of the suggested modifications of GGA depends on the target statement of the optimization problem. If the aim is the quality of solution, the recommended variant is GGA-MM. If an acceptable solution needs to be obtained at the lowest computational expense, the recommended variant is GGA-MS. Finally, a reasonable compromise may be achieved in the case of using GGA-MMS.
Study of Modifications of Gender Genetic Algorithm
287
Fig. 4. Parametric dependence between generation of convergence and achieved normalized FF value using problem space dimension as the parameter.
5 Conclusion Gender genetic algorithms are one of the classes of population-based optimization methods, which introduce specialization of the individuals by the concept of two genders. The female gender is responsible for preserving the obtained achievements, while the male gender is responsible for diversity and exploration of the search space. In this study, we have checked three modifications of gender genetic algorithms, developing and strengthening this idea. One is modified selection, when the type of selection operator depends on the gender of the selected individual (fitness proportional roulette wheel selection for females and tournament selection for males). The other one is modified mutation of single-bit inversion type, in which the probability of mutation of a bit depends on its digit for continuous chromosomes with binary encoding. Finally, the third modification consists in applying both described modifications together. All modifications were compared with each other, with unmodified GGA and with conventional genetic algorithm with no gender at the examples of four well-known artificial multi-extremum functions in the spaces of increasing dimension.
288
G. Kupriyanov et al.
It has been demonstrated that all the three modified variants of GGA improve the results of optimization in comparison with conventional genderless GA and with conventional unmodified GGA. Modified mutation improves the quality of solution, converging to solutions with better values of fitness function. Modified selection speeds up convergence of the algorithm while reducing the risk of degeneration of the population. Simultaneous use of both modifications results in obtaining a solution of reasonable quality at a reasonable computational expense. Further studies should include testing the GGA modifications and the main conclusions made in this paper on a wide set of artificial and real-life optimization problems.
References 1. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning, 13th edn. Addison-Wesley, Boston (1989) 2. Zhang, M., Zhao, S., Wang, X.: A hybrid self-adaptive genetic algorithm based on sexual reproduction and baldwin effect for global optimization. In: 2009 IEEE Congress on Evolutionary Computation, pp. 3087–3094. IEEE (2009). https://doi.org/10.1109/CEC.2009.498 3334 3. Huang, F.L.: Towards the harmonious mating for genetic algorithms. Adv. Mater. Res. 255, 2013–2017 (2011). https://doi.org/10.4028/www.scientific.net/amr.255-260.2013 4. Ramezani, F., Lotfi, S.: IAMGA: intimate-based assortative mating genetic algorithm. In: Panigrahi, B.K., Suganthan, P.N., Das, S., Satapathy, S.C. (eds.) SEMCCO 2011, LNCS, vol. 7076, pp. 240–247. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-271724_30 5. Drezner, Z., Drezner, T.D.: Biologically inspired parent selection in genetic algorithms. Ann. Oper. Res. 287(1), 161–183 (2019). https://doi.org/10.1007/s10479-019-03343-7 6. Wagner, S., Affenzeller, M.: SexualGA: gender-specific selection for genetic algorithms. In: Proceedings of the 9th World Multi-Conference on Systemics, Cybernetics and Informatics (WMSCI), vol. 4, pp. 76–81 (2005) 7. Ansótegui, C., Sellmann, M., Tierney, K.: A gender-based genetic algorithm for the automatic configuration of algorithms. In: Gent, I.P. (ed.) CP 2009. LNCS, vol. 5732, pp. 142–157. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04244-7_14 8. Shukla, N., Tiwari, M.K., Ceglarek, D.: Genetic-algorithms-based algorithm portfolio for inventory routing problem with stochastic demand. Int. J. Prod. Res. 51(1), 118–137 (2013). https://doi.org/10.1080/00207543.2011.653010 9. Holzinger, A. et al.: Darwin, Lamarck, or Baldwin: Applying evolutionary algorithms to machine learning techniques. In: 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), vol. 2, pp. 449–453. IEEE (2014). https://doi.org/10.1109/WI-IAT.2014.132 10. Kowalczuk, Z., Białaszewski, T.: Gender approaches to evolutionary multi-objective optimization using pre-selection of criteria. Eng. Optim. 50(1), 120–144 (2018). https://doi.org/ 10.1080/0305215X.2017.1305374 11. Sizov, R., Simovici, D.A.: Type-Based Genetic Algorithms. In: Kotenko, I., Badica, C., Desnitsky, V., El Baz, D., Ivanovic, M. (eds.) IDC 2019, vol. 868, pp. 170–176. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-32258-8_19 12. Drezner, T., Drezner, Z.: Gender-specific genetic algorithms. INFOR: Inf. Syst. Oper. Res. 44(2), 117–127 (2006). https://doi.org/10.1080/03155986.2006.11732744
Study of Modifications of Gender Genetic Algorithm
289
13. Kupriyanov, G., Isaev, I., Dolenko, S.: A gender genetic algorithm and its comparison with conventional genetic algorithm. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) Advances in Neural Computation, Machine Learning, and Cognitive Research VI, NEUROINFORMATICS 2022, vol. 1064. Springer, Cham (2023). https://doi. org/10.1007/978-3-031-19032-2_16 14. Kupriyanov, G.A., Isaev, I.V., Plastinin, I.V., Dolenko, T.A., Dolenko, S.A.: Decomposition of spectral contour into Gaussian bands using gender genetic algorithm. Proc. Sci. 429, 009 (2022). https://doi.org/10.22323/1.429.0009 15. DEAP 1.3.3 Documentation – Library Reference – Benchmarks, https://deap.readthedocs.io/ en/master/api/benchmarks.html. Accessed 19 June 2023
Modern Methods and Technologies in Neurobiology
Mean-Field Model of Brain Rhythms Controlled by Glial Cells Sergey V. Stasenko(B) and Tatiana A. Levanova(B) Lobachevsky State University of Nizhny Novgorod, Nizhny Novgorod 603022, Russia [email protected], [email protected]
Abstract. We propose a new mean-field model of brain rhythms controlled by glial cells. This theoretical framework describes how glial cells can regulate neuronal activity and contribute to the generation of brain rhythms. In proposed model, glial cells modulate the excitability of neurons by releasing gliotransmitters. The model takes into account the collective behavior of a large population of neurons. This approach allows us to describe how the interactions between neurons and glial cells can give rise to different patterns of synchronized activity, such as oscillations and waves. We show that such modulation can lead to a change in the period and amplitude of oscillations in the population activity of neurons.
1
Introduction
Brain rhythms, also known as neural oscillations or brain waves, are the electrical patterns of activity in the brain. They are believed to play a crucial role in the information processing within the brain, and have been linked to a variety of cognitive and behavioral functions [1]. Different types of brain rhythms are associated with different cognitive processes. For example, the theta rhythm (4–7 Hz) has been linked to working memory and spatial navigation, while the gamma rhythm (30–80 Hz) has been associated with attention and perception [2]. There are several different ways in which brain rhythms can be regulated: neuronal feedback [3], neuromodulation [4], environmental factors [6] and glial cell regulation [5]. The last factor is the least studied one. The generation and regulation of brain rhythms is a complex process that involves the interactions between many different types of cells in the brain, including neurons and glial cells. Glial cells, which include astrocytes, oligodendrocytes, and microglia, were previously thought to be merely supportive cells for neurons. However, recent research has revealed that glial cells play a much more active role in brain functioning than it was previously thought [7]. Astrocytes make the greatest contribution to synaptic transmission regulation among glial cells [8–11]. They form bidirectional connections with neurons, using which astrocytes can affect the pre- and postsynaptic compartments of the synapse by releasing gliotransmitters (such as glutamate) in a calciumdependent manner [9,10,12]. This framework is known as a tripartite synapse. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 293–299, 2023. https://doi.org/10.1007/978-3-031-44865-2_31
294
S. V. Stasenko and T. A. Levanova
When part of the neurotransmitter released from the presynaptic terminal binds to corresponding receptors on the astrocyte membrane, a cascade of biochemical reactions occurs [13]. As a result, gliotransmitters are released into the synaptic cleft and extrasynaptic space. Described process allows to modulate synaptic transmission. The functional role of astrocytes in neuronal dynamics has been extensively studied using different types of mathematical models. One such concept is the “dressed neuron,” which describes how changes in neural excitability mediated by astrocytes can impact neuronal functions [14,15]. Astrocytes have been proposed to act as frequency-selective “gate keepers” and presynaptic regulators. Corresponding gliotransmitters can effectively modulate presynaptic facilitation and depression [16,17]. The tripartite synapse model has been used to demonstrate how astrocytes participate in the coordination of neuronal signaling, particularly in spike-timing-dependent plasticity (STDP) and learning (see [18–25]). Both biophysically detailed models and mean-field models have also been used to study the astrocytic modulation of neuronal activity, revealing that functional gliotransmission is a complex phenomenon that depends on the nature of structural and functional coupling between astrocytic and synaptic elements [26–34]. Overall, brain rhythms are regulated by a complex interplay of neural and non-neural factors, and further research is needed to fully understand these mechanisms. In order to shed a light into some of these mechanisms we propose a new mean-field model that takes into account the collective behavior of a large population of neurons. The proposed model describes how glial cells modulate the excitability of neurons through the release of gliotransmitters, which can either enhance or inhibit neuronal activity. We show that such modulation can lead to a change in the period and amplitude of oscillations in the population activity of neurons.
2
The Model
Based on the results of experimental studies mentioned in the previous section, we developed a new phenomenological two-population rate model consisting of an excitatory and an inhibitory populations. This model implements crucial features of large populations of neurons in the brain, which allow to reproduce E-I-based gamma oscillations: reciprocal connections between the excitatory and inhibitory populations, E-E self-connectivity within the excitatory population, and external stimuli applied to each population (IE and II ). The schematic representation of the biological mechanism underlie E-I-based gamma oscillations is presented in Fig. 1. The proposed model is based on the the Wilson-Cowan formalism, which is described by a 2-dimensional system of ordinary differential equations (ODEs). The distinguishing feature of our model is an additional description of the dynamics of neurotransmitters for excitatory (glutamate) and inhibitory (GABAergic) synapses, which are set as constant values in the original WilsonCowan model. We also introduce the dynamics of gliotransmitter (glutamate),
Mean-Field Model of Brain Rhythms Controlled by Glial Cells
WEE
295
WII
WEI
rE
rI WIE
γY
β
Y
Fig. 1. Overall scheme representing the phenomenon of interaction of excitatory and an inhibitory populations.
which regulates the fraction of the released neurotransmitter in excitatory synapses. Mathematically the proposed model can be described using the following 5-dimensional system of ODEs: τsE dsdtE = −sE + g(Y )γE rE (1 − sE ) + sE 0, I I τsI ds = −s + γ r (1 − s ) + s , I I I I 0 dt τrE drdtE = −rE + fE (IE + ωEE sE − ωIE sI ), I τrI dr dt = −rI + fI (II + ωEI sE − ωII sI ), dY dt
=
−Y τY
(1)
+ βHY (sE ).
Note that the proposed model is a mean-field model and describes synchronous activity of neuronal populations, not single neurons. Each population has a firing rate variable ra (t) and a synaptic variable sa (t), where a = E, I. The synaptic activity depends on the population firing rate, which is scaled by γa and saturates at 1, with a decay time constant τsa and a background drive sa0 . The synaptic rise time constant is τsa /γa . The rate is a function of synaptic weights wab and a sigmoidal function f . Self- and cross-population targets have no difference in synaptic activation dynamics, so sI is similar for inhibitory inputs to excitatory and inhibitory cells, and only the weights differ. The variable Y (t) describes the dynamics of gliotransmitter release, with relaxation time τY = 1 s and an activation function HY (sE ): HY (sE ) =
1 1+
e−(sE −sE thr )/kY
.
(2)
Astrocyte activity leads to the release of gliotransmitter, which binds to presynaptic neuron receptors and changes the neurotransmitter release. The model introduces astrocytic influence on glutamate release through the function g(Y ), which includes the coefficient of astrocyte influence on synaptic connection γY and an activation threshold Ythr :
296
S. V. Stasenko and T. A. Levanova
g(Y ) = 1 +
γY . 1 + e−Y +Ythr
(3)
In our study the parameter γY was chosen as the control parameter. The I values of the remaining parameters were fixed as follows: sE 0 = 0.15, s0 = 0.1, E I E I τr = 2 ms, τr = 6 ms, τs = 3 ms, τs = 10 ms, IE = 0.9, II = 0, γE = 4, γI = 8, wEE = 3.5, wEI = 5, wII = 3, wIE = 5, γY = 0.305, βY = 10ms−1 , τY = 10 ms. The parameters of the model are chosen in such a way that, without astrocytic influence, the system (1) can demonstrate gamma oscillations (which corresponds to an oscillation frequency above 30 Hz). Most of the parameters in the model are dimensionless, as in the Wilson-Cowan model, with the exception of time constants. 2.1
Results
To investigate the impact of astrocytes on gamma oscillations arising from the interaction of excitatory and inhibitory neuron populations, we examine different scenarios with varying levels of astrocytic influence on neurotransmitter release γY . As it can be clearly seen from Fig. 2, for a wide range of γY values, synchronization and gamma oscillations emerge from the interplay between the excitatory and inhibitory populations. As γY increases, astrocytic influence on neuronal activity intensifies, disrupting the balance of excitation/inhibition and causing a decrease in oscillation frequency and amplitude, finally leading to oscillation death. Obtained results correspond to experimental data on the occurrence of excitotoxicity.
Fig. 2. The dependence of firing rate of neuron populations (red curve – excitatory population of neurons, blue curve – inhibitory population of neurons) on various values of γY .
Mean-Field Model of Brain Rhythms Controlled by Glial Cells
297
Observed results are determined by the existence of the feedback loop between neurons and glial cells and are independent of the complexity of the local dynamics of neurons and glial cells, specific characteristics of the neuronglial interaction and a particular architecture of the neural network.
3
Conclusions
In this study we have proposed a new mean-field model of brain rhythms controlled by glial cells. The proposed model provides a theoretical framework for understanding how glial cells can regulate neuronal activity and contribute to the generation of brain rhythms. The novelty of our study is the extension of a classical Wilson-Cowan model that allows to take into account the impact of glial cells, particularly astrocytes, which can modulate the excitability of neurons by releasing chemical messengers called gliotransmitters. They can either enhance or inhibit the activity of neurons, depending on the specific type of glial cell and the conditions in which they are released. Overall, the mean-field model of brain rhythm controlled by glial cells provides a promising avenue for understanding the role of glial cells in the regulation of neuronal activity and brain rhythms. However, further research is needed to fully understand the complex interactions between neurons and glial cells in the brain, and to determine more precisely the specific mechanisms by which glial cells contribute to the generation and modulation of different brain rhythms. Funding. The study was supported by the Russian Science Foundation grant # 1972-10128.
References 1. Buzsaki, G., Draguhn, A.: Neuronal oscillations in cortical networks. Science 304, 1926–1929 (2004) 2. Jensen, O., Mazaheri, A.: Shaping functional architecture by oscillatory alpha activity: gating by inhibition. Front. Hum. Neurosci. 4, 186 (2010) 3. Buzs´ aki, G., Wang, X.: Mechanisms of gamma oscillations. Annu. Rev. Neurosci. 35, 203–225 (2012) 4. Hasselmo, M., Stern, C.: Theta rhythm and the encoding and retrieval of space and time. Neuroimage 85, 656–666 (2014) 5. Poskanzer, K., Yuste, R.: Astrocytes regulate cortical state switching in vivo. Proc. Natl. Acad. Sci. 113, E2675–E2684 (2016) 6. Buzs´ aki, G., Watson, B.: Brain rhythms and neural syntax: implications for efficient coding of cognitive content and neuropsychiatric disease. Dialogues Clin. Neurosci. 14(4), 345–367 (2022) 7. Verkhratsky, A., Nedergaard, M.: Physiology of astroglia. Physiol. Rev. 98, 239– 389 (2018) 8. Araque, A., Parpura, V., Sanzgiri, R.P., Haydon, P.G.: Glutamate-dependent astrocyte modulation of synaptic transmission between cultured hippocampal neurons. Eur. J. Neurosci. 10(6), 208–215 (1998)
298
S. V. Stasenko and T. A. Levanova
9. Araque, A., Parpura, V., Sanzgiri, R.P., Haydon, P.G.: Tripartite synapses: glia, the unacknowledged partner. Trends Neurosci. 22(5), 1484–1491 (1999) 10. Wittenberg, G.M., Sullivan, M.R., Tsien, J.Z.: Synaptic reentry reinforcement based network model for long-term memory consolidation. Hippocampus 12(5), 637–647 (2002) 11. Wang, X.J.: Synaptic basis of cortical persistent activity: the importance of NMDA receptors to working memory. J. Neurosci. 19(21), 9587–9603 (1999) 12. Haydon, P.G.: GLIA: listening and talking to the synapse. Nat. Rev. Neurosci. 2(3), 185–193 (2001) 13. Perea, G., Navarrete, M., Araque, A.: Tripartite synapses: astrocytes process and control synaptic information. Trends Neurosci. 32(8), 421–431 (2009) 14. Nadkarni, S., Jung, P.: Dressed neurons: modeling neural-glial interactions. Phys. Biol. 1, 35–41 (2004) 15. Nadkarni, S., Jung, P.: Modeling synaptic transmission of the tripartite synapse. Phys. Biol. 4, 1–9 (2007) 16. Volman, V., Ben-Jacob, E., Levine, H.: The astrocyte as a gatekeeper of synaptic information transfer. Neural Comput. 19, 303–326 (2006) 17. De Pitt´ a, M., Volman, V., Berry, H., Ben-Jacob, E.: A tale of two stories: astrocyte regulation of synaptic depression and facilitation. PLoS Comput. Biol. 7, e1002293 (2011). https://doi.org/10.1371/journal.pcbi.1002293 18. Postnov, D.E., Ryazanova, L.S., Sosnovtseva, O.V.: Functional modeling of neuralglial interaction. Biosystems 89, 84–91 (2007) 19. Amiri, M., Bahrami, F., Janahmadi, M.: Functional contributions of astrocytes in synchronization of a neuronal network model. J. Theor. Biol. 292C, 60–70 (2011) 20. Wade, J.J., McDaid, L.J., Harkin, J., Crunelli, V., Kelso, J.A.S.: Bidirectional coupling between astrocytes and neurons mediates learning and dynamic coordination in the brain: a multiple modeling approach. PLoS ONE 6, e29445 (2011). https:// doi.org/10.1371/journal.pone.0029445 21. Amiri, M., Hosseinmardi, N., Bahrami, F., Janahmadi, M.: Astrocyte-neuron interaction as a mechanism responsible for generation of neural synchrony: a study based on modeling and experiments. J. Comput. Neurosci. 34, 489–504 (2013). https:// doi.org/10.1007/s10827-012-0432-6 22. Pankratova, E., Kalyakulina, A., Stasenko, S., Gordleeva, S., Lazarevich, I., Kazantsev, V.: Neuronal synchronization enhanced by neuron-astrocyte interaction. Nonlinear Dyn. 97, 647–662 (2019). https://doi.org/10.1007/s11071-01905004-7 23. Stasenko, S., Hramov, A., Kazantsev, V.: Loss of neuron network coherence induced by virus-infected astrocytes: a model study. Sci. Rep. 13, 1–11 (2023) 24. Stasenko, S., Kazantsev, V.: Dynamic image representation in a spiking neural network supplied by astrocytes. Mathematics 11, 561 (2023) 25. Stasenko, S., Kazantsev, V.: Astrocytes enhance image representation encoded in spiking neural network. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) NEUROINFORMATICS 2022. SCI, vol. 1064, pp. 200–206. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19032-2 20 26. Gordleeva, S., Stasenko, S., Semyanov, A., Dityatev, A., Kazantsev, V.: Bidirectional astrocytic regulation of neuronal activity within a network. Front. Comput. Neurosci. 6, 92 (2012) 27. Pitt` a, M.: Gliotransmitter exocytosis and its consequences on synaptic transmission. In: De Pitt` a, M., Berry, H. (eds.) Computational Glioscience. SSCN, pp. 245–287. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-00817-8 10
Mean-Field Model of Brain Rhythms Controlled by Glial Cells
299
28. Lenk, K., Satuvuori, E., Lallouette, J., Guevara, A., Berry, H., Hyttinen, J.: A computational model of interactions between neuronal and astrocytic networks: the role of astrocytes in the stability of the neuronal firing rate. Front. Comput. Neurosci. 13, 92 (2020) 29. Lazarevich, I., Stasenko, S., Kazantsev, V.: Synaptic multistability and network synchronization induced by the neuron-glial interaction in the brain. JETP Lett. 105, 210–213 (2017). https://doi.org/10.1134/S0021364017030092 30. Stasenko, S., Lazarevich, I., Kazantsev, V.: Quasi-synchronous neuronal activity of the network induced by astrocytes. Procedia Comput. Sci. 169, 704–709 (2020) 31. Barabash, N., Levanova, T., Stasenko, S.: STSP model with neuron-glial interaction produced bursting activity. In: 2021 Third International Conference Neurotechnologies and Neurointerfaces (CNN), pp. 12–15 (2021) 32. Stasenko, S., Kazantsev, V.: 3D model of bursting activity generation. In: 2022 Fourth International Conference Neurotechnologies and Neurointerfaces (CNN), pp. 176–179 (2022) 33. Barabash, N., Levanova, T., Stasenko, S.: Rhythmogenesis in the mean field model of the neuron-glial network. Eur. Phys. J. Spec. Top. 232, 529–534 (2023). https:// doi.org/10.1140/epjs/s11734-023-00778-9 34. Olenin, S., Levanova, T., Stasenko, S.: Dynamics in the reduced mean-field model of neuron-glial interaction. Mathematics 11, 2143 (2023)
Modeling Neuron-Like Agents with a Network Internal Structure Liudmila Zhilyakova(B) V. A. Trapeznikov Institute of Control Sciences, Russian Academy of Sciences, 65, Profsoyuznaya Street, 117997 Moscow, Russia [email protected]
Abstract. In the paper, we describe the model of heterogeneous agents with an internal structure and a set of parameters. These agents can generate an endogenous activity in definite time intervals. During the activation phase, the agent emits a specific mediator with an assigned color into a common medium. In case agents correspond to the biological neurons, this medium is an extracellular space, and mediators are neurotransmitters. Every agent has a set of colored receptors which are sensitive to a mediator of the same color. We consider three types of receptors and, correspondingly, three types of effects on agents. The first two types, activation and inhibition, are direct. They speed up or slow down the oscillatory regime of an agent. The third type is modulatory. This type of receptors, being activated, starts the chain of changes in the internal parameters of an agent, which, in turn, causes a change in the weights of receptors of the two first types. We show that using this representation of agents, it is possible to force an agent to generate the activity with any given time interval. Keywords: Heterogeneous Networks · Neural Circuits · Complex Agents · Half-center Oscillators
1 Introduction The aim of the paper is to introduce the model of complex neuron-like agent with an internal structure, which is able to generate the rhythmic activity with a predetermined time interval. The small sets of such agents can self-organize into small ensembles that behave like central pattern generators in simple nervous systems [1, 2]. The agents interact through modulators of different colors. All signals in the agent system are broadcast. If one agent, when activated, generates a signal of a certain color, then all agents that have receptors of this color receive this signal. In this model, we assume that “everyone hears everyone”, that is, we neglect the distance. It is believed that the mediator is sufficient for all agents. This assumption is biologically plausible for the small neural circuits. The diversity of neurotransmitters and neuron types plays an important role in nervous systems [3, 4]. In the activity of neural circuits, an important role is played not only by anatomical connections, but also by the chemical composition of the extracellular environment [5]. Changing the rhythm, switching from one rhythm to another, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 300–307, 2023. https://doi.org/10.1007/978-3-031-44865-2_32
Modeling Neuron-Like Agents with a Network Internal Structure
301
becomes possible for small neural circuits due to the effect of neuromodulation [1, 5, 6]. Neuromodulation can significantly vary such rhythm parameters as the duration of the activity and silence phases, and even transfer the neuron into the rhythmic mode from the non-rhythmic [7]. The paper is a continuation of research on building a model of heterochemical interactions of biological neurons [8]. On the other hand, with minor modifications, this model can be applied to simulate some aspects of the behavior of social network users [9–11]. Here we present a model of an agent that can itself generate endogenous rhythmic activity with a given time interval. Such memorization is characteristic of some biological neurons [12].
2 The Concept of a Heterogeneous Network of Complex Agents with an Internal Network Structure The article describes a conceptual basis for the heterogeneous network of complex agents with an internal network structure. Such agents are capable of endogenous activity caused by internal processes without the external influences. Agents have two types of sensitivity to external signals: a) direct, when the activity of neighbors can change the agent’s readiness to become active; b) indirect, when, under the influence of the environment, the agent changes its internal parameters, which include the rate of internal processes, the activation threshold, and the weight of receptors that perceive external signals. Networks of such agents with some modifications can serve as tools for simulating and studying complex interactions in various subject areas – from neurobiology (interaction of biological neurons) to social sciences (interaction of users of social networks). 2.1 Complex Agents: Their Structure and Interactions Let N = {1, …, N} be a set of agents with an internal structure. At each discrete time step t, agents can be passive or active according to one of m types. The activity type takes a value from the set C = {c1 , …, cm }. In the model, the type is specified by a color. We assume that the type of activity is a type of mediator (chemical or informational, depending on the semantics of a model). Agents exchange messages in discrete time T = {0, 1, …, t,…}. If at time t agent i was active by type cj , then at time t + 1 its activity will affect all its neighbors susceptible to this type of activity. An agent is susceptible to color cj if it has a receptor of the same color [8]. Inputs, that is, the ability to perceive information from the environment, and outputs, i.e., the type of activity and its intensity (Fig. 1), are called the external parameters of the agent. External Parameters. Agent i can have three types of receptors: excitatory, inhibitory, and modulatory receptors. Receptors of each type are characterized by color cj from set C = {c1 , …, cm } and weight wij , i.e., the strength of the influence on the agent. The excitatory and inhibitory receptors of agent i affect the internal parameter, the potential U i , which is responsible for the agent’s readiness for activation, increasing and decreasing it, respectively. If this parameter exceeds the threshold value Thi , the agent
302
L. Zhilyakova
becomes active. The modulatory receptor affects the activity of the agent indirectly, however, when a receptor of this type activates by its color cj , the agent changes its external and internal parameters and begins to respond to external activity in a different way. One more external parameter, d ij , is an intensity of activation of agent i. In other words, it is the quantity of a mediator emitting by the agent (Fig. 1).
Fig. 1. Agent i and its external parameters
Internal Parameters. The internal parameters of agent i, in accordance with the models [8–10], are: – potential, or readiness for activation, U i (t), which changes both under the influence of internal processes and under external influences, – activation threshold Thi . In the present study, the internal processes of an agent are specified not by a linear law [8], but by a network of three nodes. Two of them are constantly active half-center oscillator (HCO), i.e., the two nodes with excitatory/inhibitory connections that are active in antiphase [13–15]. HCO acts as an internal pacemaker. The more active the pacemaker is, the more often the agent becomes active. This oscillator effects on the third element, in which the potential U i (t) is accumulated with the help of the memory mechanism. The third element is characterized by the memory depth, the threshold value, and the weights of the two receptors (Fig. 2). When the accumulated memory (potential) exceeds the threshold value, the third element signals the activation of the agent. The formula for potential U i (t) of the third element with memory is as follows: Ui (t) =
−1 θ =0
μiθ
N
wij Ij (t − θ ),
j=1
where μiθ are non-negative discount coefficients satisfying the condition: 1 =μi0 > μi2 ≥ . . . ≥ μi ,
Modeling Neuron-Like Agents with a Network Internal Structure
303
Fig. 2. a. Agent i and its internal structure; b. the inner network of agent
μiθ ≥ 0, θ = 1, Θ. Θ is the maximal depth of memory. Number of zero elements μiθ can vary from 0 to Θ – 1 for different agents. Indicator I j (t – θ ) equals to 1 if one of the neighbors of agent i was active by type cj at time t – θ . If agent i activates at time t, then Ui (t + 1) resets to zero. When activated, the agent sends a signal of one of the types belonging to the set C. The choice of type depends on the semantics of the model. For the sake of simplicity, in this paper, we assume that each agent can be active by only one type. All signals coming to the agent’s receptors affect its internal network. Signals coming to excitatory and inhibitory receptors affect the potential of the third element (Fig. 2b), causing it to increase or decrease, respectively. This mechanism makes it possible to activate the agent under excitatory influences and avoid activation under inhibitory influences of sufficient strength. In this case, as soon as the signal coming to the excitatory and inhibitory receptors ends, the external effect on the potential stops. Thus, the agent changes behavior without long-term changes of structure. With an external impact on the modulatory receptors, the agent is able to change its internal parameters: the activity of the HCO, the weights of the receptors of the third element, the memory depth, and the threshold.
Fig. 3. Changes in the internal characteristics of the agent when exposed to receptors
304
L. Zhilyakova
Figure 3 shows how modulating actions can change the agent’s parameters. If the nodes of the oscillator are affected (modulation of type 1), their sensitivity to each other changes and the agent begins to obey a new internal rhythm. With modulatory influences of type 2, a long-term change in the properties of the third node can occur: the weights of its receptors, the threshold, and the depth of memory might be changed. Agents with the indicated properties model small ensembles of biological neurons, in which the restructuring of rhythms is achieved without changing the structural connections. In addition, such agents are capable of remembering time intervals (i.e., generating activity at certain time intervals), as is typical of some biological neurons [12] (see Subsect. 2.2). 2.2 Generation of Rhythmic Activity and Memorization of Time Intervals To study the functioning of the agent’s internal network, an ensemble of three nodes connected only by excitatory connections was simulated (Fig. 2b). With different memory depths and impact strengths, the third node, which transfers control outside the ensemble, demonstrates different activity patterns, transferring them, thereby, to the controlled agent. External states are activity or passivity of nodes; internal states are the values of their potentials (Fig. 4). Type of activity
U1(t)
U2(t)
U3(t)
Fig. 4. External and internal states of the nodes of the inner network without memory
Modeling Neuron-Like Agents with a Network Internal Structure
305
Figure 4 shows the results for a model without memory, in which the state of the third node is determined only by the influence of HCO on the previous time step. Node 3 is activated every second step, in-phase with node 2. By increasing the threshold for node 3 and adding memory to it, we can make it generate activity at any given intervals. An example with an interval of three steps is shown in Fig. 5. Here, we assume that w13 = 0, w23 > 0, so that only one of two neurons of HCO affects to the third neuron. On the histograms, the internal states are shown. It can be seen that on every second of the three time steps, the potential decreases due to the introduced discount factor that simulates forgetting. Type of activity
U1(t)
U2(t)
U3(t)
Fig. 5. Three-beat interval. Node 3 has depth of memory Θ = 3 and threshold Th = 0.5
The second series of experiments was carried out for a network with two positive weights of the receptors of the third element: w13 > 0, w23 > 0. A further increase in the memory threshold and depth leads to an increase in the intervals (Fig. 6). Histograms in Fig. 6 clearly demonstrate the operation of the memory of the third element. In the initial state, the activity is determined. All three elements are active: the first and third according to the second type, the second according to the first (top chart). Then the internal states start changing. The figure shows the case when a separate memory is formed for each type of activity and they do not mix.
306
L. Zhilyakova
Despite the fact that two half-center neurons were active at the same time (t = 0), they soon become synchronized so that they begin to activate in antiphase (t = 2). The potential of the first type (blue) accumulates in the third element; the potential of the second type (orange) gradually goes out. This happens because the second element of the half-center “beat” the first one, and they both began to generate activity of the first type (blue). Type of activity
U1(t)
U2(t)
U3(t)
Fig. 6. Five beat interval. Node 3 has depth of memory Θ > 5 and threshold Th = 1.1
3 Conclusion The conceptual model of agents with internal networks is introduced. The internal networks of all agents are of the same type. They consist of half-center oscillator and the third neuron that accumulates the impact of HCO and excites the agent with some time intervals. The excitation frequency depends on the properties of the HCO and of the third element (memory depth and threshold value). So, by increasing and decreasing memory and threshold, we can get an agent with different properties (more or less excitable). Thus, the proposed model reflects the effect of modulation, a long-term change in properties as a result of external influences. Besides, for agents with an internal network structure, when acting on modulating receptors, it is possible to achieve such a value of
Modeling Neuron-Like Agents with a Network Internal Structure
307
internal parameters that the agent will reproduce the rhythm with a given time interval. This is a key property not only for remembering time intervals, but also for the functional restructuring of networks and activity patterns in ensembles without costly changes in structural relationships.
References 1. Harris-Warrick, R.M.: Neuromodulation and flexibility in central pattern generator networks. Curr. Opin. Neurobiol. 21(5), 685–692 (2011) 2. Marder, E., Bucher, D.: Central pattern generators and the control of rhythmic movements. Curr. Biol. 11(23), 986–996 (2001) 3. Dyakonova, T.L., Sultanakhmetov, G.S., Mezheritskiy, M.I., et al.: Storage and erasure of behavioural experiences at the single neuron level. Sci. Rep. 9(1), 14733 (2019) 4. Aonuma, H., Mezheritskiy, M., Boldyshev, B., et al.: The role of serotonin in the influence of intense locomotion on the behavior under uncertainty in the mollusk lymnaea stagnalis. Front. Physiol. 11, N Art. 221 (2020) 5. Bargmann, C.I.: Beyond the connectome: how neuromodulators shape neural circuits. BioEssays 34, 458–465 (2012) 6. Marder, E., Weimann, J.M.: Modulatory Control of Multiple Task Processing in the Stomatogastric Nervous System: Neurobiology of Motor Programme Selection: Pergamon, pp. 3–19 (1992) 7. Turrigiano, G., LeMasson, G., Marder, E.: Selective regulation of current densities underlies spontaneous changes in the activity of cultured neurons. J. Neurosci. 15(5), 3640–3652 (1995) 8. Bazenkov, N.I., Boldyshev, B.A., Dyakonova, V., et al.: Simulating small neural circuits with a discrete computational model. Biol. Cybern. 114, 349–362 (2020) 9. Zhilyakova, L.: Model of heterogeneous interactions between complex agents. From a neural to a social network. In: Samsonovich, A., Klimov, V. (eds.) Biologically Inspired Cognitive Architectures (BICA) for Young Scientists. BICA 2017. AISC, vol. 636, pp. 213–218. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-63940-6_30 10. Zhilyakova, L.Y.: Modeling the structure of MIMO-agents and their interactions. In: Kuznetsov, S., Panov, A. (eds.) Artificial Intelligence. RCAI 2019. CCIS, vol. 1093, pp. 3–16. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30763-9_1 11. Zhilyakova, L., Gubanov, D.: Double-threshold model of the activity spreading in a social network. The case of two types of opposite activities. In: Proceedings of the 11th IEEE International Conference on Application of Information and Communication Technologies AICT2017, vol. 2, pp. 267–270 (2017) 12. Narain, D., Remington, E.D., Zeeuw, C.I.D., et al.: A cerebellar mechanism for learning prior distributions of time intervals. Nat. Commun. 9, 469 (2018) 13. Emelin, A., Korotkov, A., Levanova, T., Osipov, G.: Motif of two coupled phase equations with inhibitory couplings as a simple model of the half-center oscillator. In: Balandin, D., Barkalov, K., Meyerov, I. (eds.) Mathematical Modeling and Supercomputer Technologies. MMST 2022. CCIS, vol. 1750, pp. 82–94. Springer, Cham (2022). https://doi.org/10.1007/ 978-3-031-24145-1_7 14. Ausborn, J., Snyder, A.C., Shevtsova, N.A., Rybak, I.A., Rubin, J.E.: State-dependent rhythmogenesis and frequency control in a half-center locomotor CPG. J. Neurophysiol. 119(1), 96–117 (2018) 15. Boldyshev, B.A., Zhilyakova, L.Y.: Neuromodulation as a control tool for neuronal ensembles. Control Sci. 2, 60–67 (2021)
Cognitive Functions of Cerebellum and Educational Neuroscience Vladislav Dorofeev(B) SRISA, Moscow, Russia [email protected]
Abstract. Traditionally, the cerebellum was associated with motor functions, but with the development of methods of psychological and neurophysiological research, evidence of the influence of cerebellar dysfunction on cognitive functions has been revealed. The facts confirming the participation of the cerebellum in the development and consolidation of new cognitive skills and creative processes were also revealed. Anatomical, genetic, morphological and neurophysiological studies of the connections, evolution and interaction of the cerebellum with cognitive areas of the cerebral cortex confirm the involvement of the cerebellum in cognitive functions. Based on these studies, various models of the involvement of the cerebellum in cognitive functions are being developed. At the same time, the area of application of neuroscientific knowledge about the brain and neurotechnologies in education, educational neuroscience, is actively developing. So far, it largely consolidates the results of behavioral research, but with the development of available neurotechnologies, it begins to find direct application in the educational process. On the one hand, the results of studying the cognitive functions of the cerebellum can be useful for education, on the other hand, the study of education processes using modern neurotechnologies can be useful for studying the cerebellum. The article provides a brief overview of modern research in these areas and considers the possibilities of their joint development. Keywords: cerebellum
1
· educational neuroscience · cognitive functions
Cerebellum in Cognitive Functions
Since the 1980s, quite a lot of facts have accumulated about the connection between disorders of the cerebellum and autistic spectrum disorders. In [1], a post-mortem study of the brain of a patient with autism and a patient who died of another cause was carried out. In the brain of a patient with autism, a significant deficiency of cerebellar cells was revealed. In [2], the functions of the cerebellum were blocked in mice by a chemogenetic method; mice with blocked functions demonstrated autistic impairments in activity. A modern review of the relationship between cerebellar disorders and autism is given in [3]. Also, the facts of the relationship of disorders of the cerebellum with schizophrenia were studied and the mechanisms of these disorders were revealed to the molecular c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 308–313, 2023. https://doi.org/10.1007/978-3-031-44865-2_33
Cognitive Functions of Cerebellum and Educational Neuroscience
309
level [4]. In [5], the case of a patient, a 49-year-old lawyer, who suffered a stroke in the right side of the cerebellum is considered. The violations of the following nonmotor functions were revealed: practice-related learning, error detection. In [6], in tests for the specific use of working memory with a delay in stimulus presentation, problems with working memory were noted in mice that were influenced by optogenetic methods on cerebellar Purkinje cells. In [7], linguistic tests were carried out on patients with cerebellar disorders and the control group. The Oral Sentence Production Test showed no difference. Test of Language CompetenceExpanded revealed the following disorders in patients in relation to the control group: (1) violation of the automatic adaptation of grammatical and semantic abilities to the linguistic context when constructing sentences (2) violation of automatic adaptation to the linguistic context when interpreting sentences (3) impairment of cognitive processes necessary for such linguistic skills as analysis and sequential logical thinking In 2015 [9], an fMRI study of a drawing person was carried out on an MRIsafe tablet. An increased activity of the cerebellum was revealed when solving creative problems. At the same time, the hypothesis about the participation of the cerebellum by V. L. Dunin-Barkovsky was expressed at a lecture at MEPhI in 2010 [8]. At Massachusetts General Hospital, the study and treatment of cognitive impairment in cerebellar injuries has been carried out at the Ataxia Center under the direction of Jeremy Schmachmann since 1994. For the treatment of disorders, mental exercises and electromagnetic stimulation of the cerebellum are used. In clinical practice, the concept of Cognitive cerebellar affective syndrome, also called Schmachmann’s syndrome, is used with methodology to measure it [10]. The first hypothesis about the participation of the cerebellum in cognitive functions was proposed by the Leiner spouses together with the well-known researcher in the field of cerebellar neurophysiology Robert Dow [11]. The Leiners came to neurophysiology from developing computers in DARPA projects, they proposed an analogy between the computer’s central processing unit and the cerebellum. Paying attention to the fact that the human cerebellum has grown significantly compared to simpler mammals, as well as to the anatomical connections of the cerebellum with the prefrontal cortex (through the thalamus), they suggested that the cerebellum also operates information processes in the brain as motor functions. Their hypothesis was criticized, but it was supported by Masao Ito, who since the early 60s, together with Nobel laureate John Eccles, has been studying the hypothesis that the cerebellum is the brain’s CPU [12]. With the development of methods for studying the brain, confirmation of the anatomical and neurophysiological connection of the cerebellum with the prefrontal cortex appeared [13,14]. In [15], a comprehensive overview of them as of 2013 is given.
310
V. Dorofeev
In 2008, Masao Ito published an article with a hypothesis about how the cerebellum, performing the functions of direct and reverse modeling of mental models, is involved in general cognitive activity, presenting his model of human thought [16]. Jeremy Schmachmann, based on the ideas of Leiner, Ito and clinical practice, formulated two theories: the theory of universal transformation of the cerebellum and the theory of thinking dysmetria. The theory of universal transformation of the cerebellum states that the cerebellum performs the same transformation function for all signals entering it, motor and non-motor. And the theory of thinking dysmetria states that when the parts of the cerebellum associated with cognitive functions are disturbed, there are violations in thinking, similar to violations of motor functions [10]. The work [17] presents an ANN-model of the participation of the cerebellum in motor and cognitive activity. In it, the cerebellum learns a real-world reward model and then, using this model, produces an intermediate reward for the agent. This helps solve the credit assignment problem in reinforcement learning. Disabling this feature results in the motor and cognitive impairments that are common in patients with cerebellar disorders. The neural network architecture of Decoupling Neural Interfaces used in [17] was taken from the earlier work of the DeepMind research group [18]. In [19] the same team introduced more complex ANN-model of task acquisition, switching and consolidation with the cerebellum.
2
Educational Neuroscience: Current State and Problems
The idea of using brain sciences in the educational process has been known for a long time and has already found practical application, for example, in one of the most popular Coursera’s course “Learning how to learn” [20]. Since the mid-1960s, the results of cognitive psychology have been actively used in education. Since the early 2000s, publications began to appear about the possibilities of using the results of neuroscience in education. Currently, the debate between neuroscientists and psychologists about the usefulness of neuroscience for education continues. Neuroscientists say that the use of neuroscience increases the scientific validity of recommendations, psychologists argue that no really useful recommendations from neuroscience have yet been received, while behavioral analysis is successfully implemented in educational instructions [21–23]. Nevertheless, neurotechnologies are being actively introduced into educational practices [24]. For example, in China, systems are being introduced that are capable of monitoring the attention of students during lessons based on EEG monitoring. Then, on the basis of the data collected in this way, procedures are envisaged to encourage attention [25].
Cognitive Functions of Cerebellum and Educational Neuroscience
3
311
Applications of Cognitive Cerebellum Research for Educational Neuroscience
The above results of studies of cerebellar cognitive functions: the impact of cerebellar disorders on autistic spectrum disorders [3], shizophrenia [4], working memory [6], linguistic skills [7], creative processes [9], the acquisition and consolidation of cognitive skills [19] open up new opportunities for improving the educational process. So far, research on cognitive functions of the cerebellum has more to do with clinical practice than with developmental psychology and education. Although there is already a fact of using research on the relationship of dyslexia [26], dysgraphia and procedural learning [27] with the educational practice [28]. In diagnostics, in addition to traditional psychological methods for detecting cognitive impairment, the Schmachmann scale can be used to detect cognitive impairment associated with the cerebellum. [10]. Sometimes general recommendations on the educational process are possible. For example, the harmonious development of motor and cognitive functions to improve the quality of the cerebellum. In 1982, Dr. Belgau developed the Learning Breakthrough Program, which is a set of balancing exercises using the Belgau board [29]. This program has found application in speech therapy practice and, according to some researchers, it also helps to improve cognitive abilities [30]. But often for use research results in the educational process, mass-available neurotechnological tools of monitoring the state of the cerebellum and improving its quality of work are needed, such in above case with attention monitoring [25]. These tools are in research and development. The articles [31] and [32] provide examples of the use of EEG to analyze the effect of transcranial magnetic stimulation of the cerebellum on the functioning of the cortico-cerebellar network.
4
Conclusion
Using the results of research in the field of cognitive functions of the cerebellum in education can be useful both for education and for basic research on the mysteries of the cerebellum. Education receives scientifically proven tools to improve the quality of learning, and neuroscience that explores the cerebellum receives an extensive experimental base for development. At the same time, instead of disputes between psychologists and neuroscientists about priority in this area, it is advisable to move to a unified systemic science of the brain, which implies the consistent development of methods and tools for collecting and processing information about the real world (the implementation of neurotechnologies into the educational process) and multi-scale models of the brain, from the level of higher mental functions down to the molecular level. Acknowledgements. The work is financially supported by State Program of SRISA RAS No. FNEF-2022-0003.
312
V. Dorofeev
References 1. Bauman, M., Kemper, T.: Histoanatomic observations of the brain in early infantile autism. Neurology 35(6), 866–866 (1985). https://doi.org/10.1212/WNL.35.6.866 2. Stoodley, C.J., et al.: Altered cerebellar connectivity in autism and cerebellarmediated rescue of autism-related behaviors in mice. Nat. Neurosci. 20(12), 1744– 1751 (2017). https://doi.org/10.1038/s41593-017-0004-1 3. Bloomer, B.F., Morales, J.J., Bolbecker, A.R., Kim, D.J., Hetrick, W.P.: Cerebellar structure and function in autism spectrum disorder. J. Psychiatry Brain Sci. 7, e220003 (2022). https://doi.org/10.20900/jpbs.20220003 4. Yeganeh-Doost, P., Gruber, O., Falkai, P., Schmitt, A.: The role of the cerebellum in schizophrenia: from cognition to molecular pathways. Clinics 66(Supplement 1), 71–77 (2011). https://doi.org/10.1590/S1807-59322011001300009 5. Fiez, J.A., Petersen, S.E., Cheney, M.K., Raichle, M.E.: Impaired non-motor learning and error detection associated with cerebellar damage: a single case study. Brain 115(1), 155–178 (1992). https://doi.org/10.1093/brain/115.1.155 6. Deverett, B., Kislin, M., Tank, D.W., Wang, S.S.H.: Cerebellar disruption impairs working memory during evidence accumulation. Nat. Commun. 10(1), 3128 (2019). https://doi.org/10.1038/s41467-019-11050-x 7. Guell, X., Hoche, F., Schmahmann, J.D.: Metalinguistic deficits in patients with cerebellar dysfunction: empirical support for the dysmetria of thought theory. The Cerebellum 14(1), 50–58 (2014). https://doi.org/10.1007/s12311-014-0630-z 8. Dunin-Barkowski, W.L.: Theory of cerebellum. ResearchGate (2010). https:// www.researchgate.net/publication/278785563 Theory of cerebellum 9. Saggar, M., et al.: Pictionary-based fMRI paradigm to study the neural correlates of spontaneous improvisation and figural creativity. Sci. Rep. 5(1), 1–11 (2015). https://doi.org/10.1038/srep10894 10. Hoche, F., Guell, X., Vangel, M.G., Sherman, J.C., Schmahmann, J.D.: The cerebellar cognitive affective/Schmahmann syndrome scale. Brain 141, 248–270 (2018). https://doi.org/10.1093/brain/awx317 11. Leiner, H.C., Leiner, A.L., Dow, R.S.: Does the cerebellum contribute to mental skills? Behav. Neurosci. 100(4), 443 (1986). https://doi.org/10.1037/0735-7044. 100.4.443 12. Eccles, J.C., Ito, M., Szent´ agothai, J.: The Cerebellum as a Neuronal Machine. Springer, New York (1967). https://doi.org/10.1007/978-3-662-13147-3 13. Leiner, H.C., Leiner, A.L., Dow, R.S.: Cognitive and language functions of the human cerebellum. Trends Neurosci. 16(11), 444–447 (1993). https://doi.org/10. 1016/0166-2236(93)90072-T 14. Middleton, F.A., Strick, P.L.: Anatomical evidence for cerebellar and basal ganglia involvement in higher cognitive function. Science 266(5184), 458–461 (1994). https://doi.org/10.1126/science.7939688 15. Buckner, R.: The cerebellum and cognitive function: 25 years of insight from anatomy and neuroimaging. Neuron 80(3), 807–815 (2013). https://doi.org/10. 1016/j.neuron.2013.10.044 16. Ito, M.: Control of mental activities by internal models in the cerebellum. Nat. Rev. Neurosci. 9(4), 304–313 (2008). https://doi.org/10.1038/nrn2332 17. Boven, E., Pemberton, J., Chadderton, P., Apps, R., Costa, R.P.: Cerebrocerebellar networks facilitate learning through feedback decoupling. Nat. Commun. 14(1), 51 (2023). https://doi.org/10.1038/s41467-022-35658-8
Cognitive Functions of Cerebellum and Educational Neuroscience
313
18. Jaderberg, M., et al.: Decoupled neural interfaces using synthetic gradients. In: International Conference on Machine Learning, pp. 1627–1635. PMLR (2017). https://proceedings.mlr.press/v70/jaderberg17a.html 19. Pemberton, J., Chadderton, P., Costa, R.P.: Cerebellar-driven cortical dynamics enable task acquisition, switching and consolidation. bioRxiv (2022). https://doi. org/10.1101/2022.11.14.516257 20. Oakley, B., Sejnowski, T., McConville, A.: Learning How to Learn: How to Succeed in School Without Spending All Your Time Studying; A Guide for Kids and Teens. Penguin (2018) 21. Bowers, J.S.: The practical and principled problems with educational neuroscience. Psychol. Rev. 123(5), 600–612 (2016). https://doi.org/10.1037/rev0000025 22. Howard-Jones, P.A., et al.: The principles and practices of educational neuroscience: commentary on Bowers (2016). Psychol. Rev. 123(5), 620–627 (2016). https://doi.org/10.1037/rev0000036 23. Farmakopoulou, I., Theodoratou, M., Gkintoni, E.: Neuroscience as a component in educational setting. An interpretive overview. Technium Educ. Humanit. 4, 1–7 (2023). https://doi.org/10.47577/teh.v4i.8236 24. Alibigloo, H.P., Alipoor, J.: A critical study on the researches about the application of neurothecnology in education. Authorea Preprints (2023). https://doi.org/10. 22541/au.167416698.81849768/v1 25. Tang, H., Dai, M., Du, X., Hung, J.L., Li, H.: An EEG study on college students’ attention levels in a blended computer science class. Innov. Educ. Teach. Int. (2023). https://doi.org/10.1080/14703297.2023.2166562 26. Nicolson, R., Fawcett, A.J., Dean, P.: Dyslexia, development and the cerebellum. Trends Neurosci. 24(9), 515–516 (2001). https://doi.org/10.1016/S01662236(00)01923-8 27. Nicolson, R.I., Fawcett, A.J.: Dyslexia, dysgraphia, procedural learning and the cerebellum. Cortex 47(1), 117–127 (2011). https://doi.org/10.1016/j.cortex.2009. 08.016 28. Borges, M., et al.: Deviations from a typical development of the cerebellum in youth are associated with psychopathology, executive functions and educational outcomes. Psychol. Med. 1–11 (2022). https://doi.org/10.1017/S0033291722002926 29. Belgau, F., Belgau, B.V.: Learning Breakthrough Program. Balametrics, Port Angeles (1982) 30. Reynolds, D., Nicolson, R.I., Hambly, H.: Evaluation of an exercise-based treatment for children with reading difficulties. Dyslexia 9(1), 48–71 (2003). https://doi.org/ 10.1002/dys.235 31. Gaßmann, L., Gordon, P.C., Ziemann, U.: P 58 EEG responses from direct cerebellar activation with Transcranial Magnetic Stimulation. Clin. Neurophysiol. 137, e1–e69 (2022). https://doi.org/10.1016/j.clinph.2022.01.089 32. Sasaki, R., Hand, B.J., Liao, W.Y., et al.: Utilising TMS-EEG to assess the response to cerebellar-brain inhibition. Cerebellum 22, 544–558 (2022). https:// doi.org/10.1007/s12311-022-01419-y
The Role of Pulvinar Nucleus as a Sinchronizer of Cortical Activity for Visual Target Detection is Caused by Its Functioning as Superior Colliculus–Cortex Intermediary I. A. Smirnitskaya(B) Scientific Research Institute for System Analysis of the Russian Academy of Sciences, 117218 Moscow, Russia [email protected]
Abstract. The brain appeared for the movement control. Evolution has turned its function into behavior control. In the course of evolution, the brain has evolved into an extremely complex hierarchically organized system. To create a simplified model that algorithmically describes its structure, it was proposed to construct two parallel sequences: a sequence of models of brain structure representing the stages of its phylogeny, and a sequence of animal behavior models at the corresponding stages of phylogeny. As an illustration of the usefulness of such an approach for creating an operational model of a specific brain regions, the functions of the thalamic pulvinar nucleus in the control of visually guided behavior is discussed. It is concluded that the experimentally discovered role of the pulvinar as an initiator and synchronizer of parieto-frontal interactions is due to its main input signals from the superior colliculus and the pretectum. Keywords: evolution · pulvinar nucleus · superior colliculus/optic tectum · visually guided behavior · zebrafish · parieto-frontal synchronization
1 Introduction An experimenter who registers the characteristics of a dynamic process that require, depending on the data obtained, adjustments to the registration parameters can do this directly in the process of work, in real time. If the measured variables have a complicated structure that requires a significant change in the measurement process, then it is possible, after conducting a certain number of measurements, to stop the registration, to subject the obtained data to more complex processing that requires a longer time, and then, in accordance with its results, to make the adjustment of the experimental setup. Let’s take a look from this point of view at the brain control of visually guided behavioral sequence.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 314–322, 2023. https://doi.org/10.1007/978-3-031-44865-2_34
The Role of Pulvinar Nucleus as a Sinchronizer of Cortical Activity
315
2 Visually Guided Prey Hunting in Larval Zebrafish In recent years, the larval zebrafish predatory hunting has been analyzed in detail[1]. Just starting to swim, the larva begins hunting of paramecia. A moving larva distinguishes a potential prey from a dangerous predator. A small spot, having an angular size of 3–6° at a distance of about 3 mm from it, in the field with an angle from 0 to 60° from the midline in both directions, is a prey, if the size is larger, it is a dangerous predator [1]. Fish lack visual cortex, the main visual area that retinotopically displays the outside world is the optic tectum. The tectum is the equivalent of the mammalian superior colliculus. The tectum receives input from the output cells of the retina (retinal ganglion cells). Tectal neurons project to reticulospinal neurons in the midbrain and hindbrain that directly connected to the motoneurons (Fig. 1A, B). Next to the tectum is the pretectum, which receives inputs from the same retinal cells. The roles of the tectum and the pretectum are different. The tectum plays the role of a map that determines the position and direction to the prey, or the source of danger, and is a rough classifier of objects depending on their size and position in the visual field. The pretectum is more responsible for movement, and has larger visual receptive fields than the tectum [2].
Fig. 1. A. Brain regions of zebrafish larva that control prey hunting, B – diagram of connections between this areas.
Nucleus isthmi is the third structure, when lesioned, hunting loses the character of goal directed behavior [3]. It is the analog of the parabigeminal nucleus of mammals [4]. This is a cholinergic structure located in the of the midbrain tegmentum, near to other cholinergic nuclei: pedunculopontine, laterodorsal. The population of retinal cells that sends axons to the tectum and pretectum show selectivity to prey-like stimulus [1].
316
I. A. Smirnitskaya
Dynamic characteristics of prey hunting are determined during the interaction between the tectum, the pretectum and the cholinergic nucleus of the isthmus [1, 5, 6]. (Fig. 2) Hunting is an innate behavior, and is divided into stereotypical stages, schematically presented in Fig. 2. The left figure represents the preliminary stages: displaying an object on the retina, recognizing the prey in it, and determining its angular coordinate, the next figure is the larva’s turn towards the prey, that is, changing the orientation by a “J-shaped turn”, even to the right, - slow movement in the direction of prey, the rightmost drawing, - capture of paramecia [1].
Fig. 2. Stages of hunting behavior of the zebrafish larva. On the left, - stages 1, - detection of the prey, and recognition of the object of hunting in it, to the right, stage 2, - “J- turn” towards the prey, provided by the characteristic bending of the tail in the shape of the letter J, next to the right, stage 3, the movement of the fish to the prey, such as to be at a distance of one throw from it, to the right a throw ending with the suction of the infusoria.
3 Hierarchy of Primate (and Human) Brain Regions that Perceive a Visual Object and Control Visually Guided Movements How do visual signals trigger the activity of areas of the brain which controls human movements? It is generally believed that in higher animals visual perception is carried out by sequential analysis of visual signals by areas of the visual cortex, striate and extrastriate. The signal from the retina goes to the visual nucleus of the thalamus, the lateral geniculate body, and from it to the visual areas of the cortex in a hierarchical order, first to V1, then to V2, then V3, and then to V4 and inferotemporal areas. However, it turned out that the brain regions that lower vertebrates use for vision are also present and function in mammals[7]. What role do they play in this case? Box 1.An evolutionary behavioral approach to creating a model of the structure of the brain.
The Role of Pulvinar Nucleus as a Sinchronizer of Cortical Activity
317
The approach we described earlier [8] will help to understand the relationship between different visual areas. Namely, the roles and algorithms of interactions of different subdivisions of the brain can be tried to understand, considering that since the brain is a behavior control device that becomes more complex in the course of evolution together with the complication of behavior, it is useful to build two sequenc es in parallel: animal behavior models at successive stages of evolution, and models of successive stages of phylogeny of the brain - the evolution of the device, that control this behavior. Statement 1. Each subsequent step of evolution that changes the s tructure of the brain does not cancel or change the functions of the structure that was at the previous step (if it has been preserved and is functioning). Statement 2. Each subsequent step of evolution, which changes the structure of a certain part of the brain, builds a new floor over the device that existed before it. The purpose of new level construction is to correct the work of the previous device so as to allow its function to be performed only if additional conditions are met. In some cases, evolution may add a completely new part.
Let’s go back to the analogy given at the beginning between the modes of operation of the experimental setup and the ways in which the brain controls behavior. The second method, when the experimental setup first collects data, then stops, data processing takes place, the parameters of its operation are adjusted and the installation starts again, is not similar to the work of the brain, since purposeful behavior is continuous. But in our brain, when performing each new task, learning occurs, during which behavior slows down. In computer modeling of the brain, learning is considered as a separate mode of operation of the model. It is for learning that new areas of the cerebral cortex appear in the course of evolution. However, the peculiarity of the brain is that both learning and behavior control occur simultaneously. Earlier, in [9], it was suggested that the visual picture or visual scene in front of a person’s eyes is represented by the distributed activity of neurons in several brain regions. First, this is a general sketch of the scene. For example, for Shishkin’s painting “Morning in the Pine Forest”, this is a pine forest playing the role of context, a silhouette of a fallen tree obliquely crossing the scene, and figures of cubs playing on a tree. It can be assumed that the general sketch is presented in the posterior parietal cortex. Secondly, in another area of the brain, the objects in the picture, that is, the tree and the cubs, are presented separately with high resolution. These should be areas displaying objects such as V4, TE, TEO. Thirdly, since we are discussing the mechanism of the process of gazing taking place at the moment, the striate visual areas V1, V2, V3 should be active, the functions of which are to display the local characteristics of objects (contour, texture, color, etc.) giving the visible picture a sense of momentary authenticity. The first and second points represent the neural activity of areas that store the memory of previously seen scenes and objects, the third point describes the activity of areas that display small details of what is happening in real time. This picture lacks a basis, a representation, albeit very rough, with a small resolution, of the entire scene standing in front of our eyes at the moment, which will disappear and be replaced by a new one if we turn our eyes in the other direction. In [9] it was argued that this role is played by the pulvinar thalamic
318
I. A. Smirnitskaya
nucleus, the main subcortical inputs of which are the tectum (the superior colliculus of mammals) and the pretectal regions. Let’s extend this statement from the static display of the visual scene described above to dynamic behavior control based on visual information. Let’s use an evolutionary behavioral approach [8]. The prey hunting of naive zebrafish larva analyzed above, with no previous hunting experience, is real-time behavior based on innate mechanisms of movement towards an object recognized with the help of innate detectors as a prey. The behavioral repertoire of the zebrafish larva includes: hunting behavior, flight from danger, research behavior. In each case, successive stages can be distinguished: 1. perception of a new sensory signal, 2. choice of the type of behavior, 3. behavioral stages - initial, maintenance, final (In Fig. 2 the left picture represents 1 and 2 stages; the next 3 pictures represent stage 3). The behavior of a person who finds himself in a new environment is divided into similar stages. Stage 1. A new sensory signal is displayed on the retina and a person makes an orienting movement towards a new object. Stage 2.The choice. a) A huge impending object, in this case stage 3 is an immediate escape, which can be controlled by the same structures as in the zebrafish larva, b) a dangerous object, the degree of danger of which needs additional assessment, c) an attractive object, then stage 3 is to go to it, or to grasp it, d) a neutral new object, in this case stage 3 is a study an object, e) a neutral familiar object, then stage 3 is ignoring. In all cases, except for immediate escape, the choice of action, unlike the choice of the zebrafish larva, occurs by comparing, often multi-stage, iterative, a new object with previously encountered, memorized objects. The behavior is also guided by structures other than those of the zebrafish larva. However, there is an important common part in controlling the behavior of both fish and humans. Behavior at its core is what is happening at the moment. And the structures that have passed to us in the course of evolution from common predecessors that control the current behavior of the fish, according to statement 1 of the evolutionary behavioral approach, perform the same function in humans. Let’s analyze this statement in more detail. The common structure of the fish is the optic tectum, an analog in humans is called the superior colliculus. New human structures are the visual areas of the cortex and others involved in the control of behavior based on vision and memory. It is known that the brain of all vertebrates is built according to a single plan, the differences are only quantitative. Fish have both a thalamus and an analogue of the cortex, the pallium. However, in lower vertebrates, those structures that are developed in higher animals and clearly divided into sub-departments may be poorly differentiated, have a much smaller volume, and for this reason do not provide higher functions.
The Role of Pulvinar Nucleus as a Sinchronizer of Cortical Activity
319
The tectum is a registering area, so an intermediary is needed to interact with new structures. It is the thalamic pulvinar nucleus (in humans and primates) or the lateral posterior thalamic nucleus in other mammals. In humans, the pulvinar nucleus is 40% of the volume of the thalamus. Let’s analyze what functions they perform, taking into account the dynamics of behavioral control. Stage 1 – the display of a new object on the retina - is approximately the same, both in humans and in fish. Stage 2, - selection. Table 1. Brain regions that control the stages of a behavioral sequence. Abbreviation: IFT – inferior temporal cortex, AMY-amygdala, TEO- posterior inferotemporal area, STP - superior temporal polysensory cortex area, OFC – orbitofrontal cortex, MPFC – medial prefrontal cortex New object appearence
The choice of action
The maintenance of behavioral sequence
Zebrafish larva
Retina, pretectum, midbrain, hindbrain
Tectum + nucleus isthmi
Tectum + pretectum nucleus isthmi
Primates
Retina
Tectum → IFT → AMY Tectum → Pulvinar → TEO, STP, AMY, OFC, MPFC
Tectum, Visual areas, Parietal areas, Premotor, Motor, Prefrontal
Fish vision is categorical: depending on the position, size and speed of external objects, they are assigned a response category. A human has object vision, his external world consists of objects. But in order to control behavior, the human brain also, ultimately, assign a category of response to objects. And it is the same as in fish: “ + “ objects, “-” objects (Table 1), neutral new or familiar objects. Unlike naive zebrafish larvae, when classifying objects, a human will compare the scene in front of his eyes, dividing it into objects with the content of his own memory gained during the learning process. Stage 3, behavioral, of a human is also different from the behavior of the zebrafish larva, the participation of the superior colliculus/tectum is common. The interaction of the tectum and the external environment can be likened to a crawling conveyor belt. Images recognized by innate fish detectors fall on the fish conveyor belt, urging it to swim somewhere; episodes of human experience fall on the human conveyor belt, and in the same way force it to perform actions appropriate in this case. To justify which cortical areas the thalamic pulvinar nucleus should interact with in order to ensure the implementation of behavior, it is worth relying on the well–known idea of two streams of visual information transformation - ventral and dorsal. It is believed that the ventral flow transforms information about the meaning of the “what” of a visually perceived object; and the dorsal flow represents various interrelated aspects of the spatial location of the object: “where”. The ventral flow goes from the secondary visual areas to the inferotemporal regions where the representation of objects seat, which further give inputs to the amygdala and orbitofrontal cortex. The flow parallel to it goes from the
320
I. A. Smirnitskaya
Fig. 3. Diagram of the main connections of the divisions of the primate pulvinar nucleus with cortical and subcortical areas. Abbreviations: PI – inferior pulvinar nucleus, PL – lateral pulvinar nucleus, PM – medial pulvinar nucleus.
superior colliculus through the pulvinar nucleus to the same inferotemporal regions. The amygdala, insula, and orbitofrontal cortex evaluate the value properties of objects on the basis of which a behavioral choice is made [10]. That is, the interpretation of the ventral flow “what” is more justified, as carrying out the choice of the most appropriate action with the object (column “choice” of Table 1), during which the object is also memorized. From the same point of view, the dorsal stream going through the dorsal posterior parietal to the dorsolateral prefrontal cortex and to the frontal eye field is dedicated not to finding out the location of the object, but to controlling sequential actions with it. For example, to grasp, it is necessary not only to reach out to the object (reaching), but also to prepare the configuration of the fingers in accordance with the visible shape of the object, and, depending on the gravity of the object assessed by the eye, grab it with the right effort. The actions of the hand and fingers in each of these 3 categories should be controlled in parallel and be coordinated in time. If the pulvinar nucleus works as a coordinator of these different aspects of visual behavior control, then it should receive their representation from the corresponding cortical areas, and send feedback signals there. The presence of such connections is experimentally shown (Fig. 3). Activation of these areas is devided into stages. Stage 1 – the appearence of signal on the retina, performed without the participation of the cortex. Stage 2 “selection” is divided into parallel streams: a) recognition of the image of a previously encountered object in the retina signal, and assigning it a response category. There is a quick pulvinar – inferotemporal area – amygdala transfer,
The Role of Pulvinar Nucleus as a Sinchronizer of Cortical Activity
321
the function of which is to assign objects to the dangerous/non-dangerous category. A person feels it in the form of emotions. b) In parallel, the signal goes to the temporal regions, activating the exchange of signals between the amygdala, orbitofrontal and medial prefrontal cortex, to determine more subtle interactions with the object. Stage 3 – conducting a sequence of actions with the object. In the case of eye movement, this is controlled by frontal eye field. For hand movement, these are the premotor and motor areas. In both cases, signals are transmitted both from the cortical region (posterior parietal) and from the pulvinar nucleus. In stage 3, the cortical regions should also send feedback signals both to each other and to the pulvinar nucleus. It was found that the signals are exchanged in a θ- rhythm [11].
4 What are the operations of the pulvinar nucleus when performing behavioral tasks. Data on people with damage to the pulvinar nucleus and experiments on primates indicate that the inactivation of the pulvinar nucleus makes it impossible to maintain selective attention [12]. Selective Attention, The Definition. The selection among the objects from the surroundings of some newly appeared object for subsequent actions with it, and the maintenance of this selection in the course of successive actions with it is called attention. Operationally, this is ensured by strengthening the weight of the neural representation of this object, while simultaneously reduction of the weights of the neural representations of surrounding objects. Attention from below, or “down-up attention” is the selection when an object unexpectedly appears, in our description this happens at stage 1. “Topdown” attention is such selection during the execution of a behavioral sequence (that is, at stage 3 from above discription). Studies of the role of the pulvinar in visual attention tasks in primates had shown that the influence of the pulvinar nucleus on the transmission of signals from the V4 to TEO is crucial for visual attention [11]. This influence goes through a change in the synchronization of this cortical regions in - rhythm, and the pulvinar nucleus makes them active alternately [13]. Namely, the parietal area LIP is active during increased attention to the visual target, and the visual-motor area FEF is active during saccades, that is, when it is possible to transfer attention to another object.
5 Conclusion A large complex of modern studies of thalamo-cortical interactions concerning the influence of the pulvinar nucleus on the maintenance of visual attention has revealed a clear hierarchically arranged system of interactions between cortical and subcortical areas. The patterns of synchronization of cortical regions found need to be investigated using computational models that would allow us to formulate new questions useful for a more complete study of the organization of this thalamo-cortical system. Funding. The review was done within the 2022 state task FNEF-2022-0003 Research into Neuromorphic Big-Data Processing Systems and Technologies of Their Creation.
322
I. A. Smirnitskaya
References 1. Zhu, S., Goodhill, G.J.: From perception to behavior: the neural circuits underlying prey hunting in larval zebrafish. Front Neural Circuits 1(17), 1087993 (2023) 2. Wang, K., Hinz, J., Zhang, Y., Thiele, T.R., Arrenberg, A.B.: Parallel channels for motion feature extraction in the pretectum and tectum of larval zebrafish. Cell Rep. 30(2), 442–453 (2020) 3. Henriques, P.M., Rahman, N., Jackson, S.E., Bianco, I.H.: Nucleus Isthmi Is required to sustain target pursuit during visually guided prey-catching. Curr. Biol. 29(11), 1771–1786 (2019) 4. Deichler, A., et al.: A specialized reciprocal connectivity suggests a link between the mechanisms by which the superior colliculus and parabigeminal nucleus produce defensive behaviors in rodents. Sci. Rep. 10, Article number: 16220 (2020) 5. Schryve, H.M., Straka, M., Mysore, Sh.P.: Categorical signaling of the strongest stimulus by an inhibitory midbrain nucleus. J. Neurosc. 40(21), 4172–4184 (2020) 6. Fernandes, A.M., et al.: Neural circuitry for stimulus selection in the zebrafish visual system. Neuron 109(5), 805–822 (2021) 7. Felsen, G., Mainen, Z.F.: Neural substrates of sensory-guided locomotor decisions in the rat superior colliculus. Neuron 60(1), 137–148 (2008) 8. Smirnitskaya, I.A.: The thalamic nuclei classification in relation to their engagement in the correction of initial movements. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y., Klimov, V.V. (eds.) NEUROINFORMATICS 2021. SCI, vol. 1008, pp. 142– 148. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-91581-0_19 9. Smirnitskaya, I.A.: The distributed representation of space in superior colliculus, pulvinar nucleus and parietal cortex. In: Proceedings on XVI International Conference on Neuroinformatics, 27–31 January 2014, Part 1, Moscow, Russia, pp. 147–154 (2014). (in Russian) 10. Grabenhorst, F., Schultz, W.: Functions of primate amygdala neurons in economic decisions and social decision simulation. Behav. Brain Res. 409, 113318 (2021) 11. Saalmann, Y.B., Pinsk, M.A., Wang, L., Li, X., Kastner, S.: The pulvinar regulates information transmission between cortical areas based on attention demands. Science 337(6095), 753–756 (2012) 12. Kastner, S., Fiebelkorn, I.C., Eradath, M.K.: Dynamic pulvino-cortical interactions in the primate attention network. Curr. Opin. Neurobiol. 65, 10–19 (2020) 13. Fiebelkorn, I.C., Kastner, S.: A rhythmic theory of attention. Trends Cogn. Sci. 23, 87–101 (2019)
The Influence of Anxiety and Exploratory Activity on Learning in Rats: Mismatch-Induced c-Fos Expression in Deep and Superficial Cortical Layers Alexandra I. Bulava1,2(B) , Zhanna A. Osipova1 , Vasiliy V. Arapov3 , Alexander G. Gorkin1 , Igor O. Alexandrov1 , Tatiana N. Grechenko1 , and Yuri I. Alexandrov1,3,4 1 Shvyrkov Laboratory, Neural Bases of Mind, Institute of Psychology,
Russian Academy of Sciences, 129366 Moscow, Russia [email protected] 2 Moscow Institute of Psychoanalysis, Moscow, Russia 3 Moscow State University of Psychology and Education, Moscow, Russia 4 Department of Psychology, National Research University Higher School of Economics, Moscow, Russia
Abstract. It has been shown that various aspects of cortical activity associated with behavioral adaptation are provided by layer-specific synaptic dynamics, structural and connectional organization, as well as by layer-specific protein levels. In this study we used c-Fos immunolabeling to identify the experience-dependent mismatch specific changes in cortical activity. To determine the parameters of individuality and to assess the extent of relationship between different parameters, we analyzed the behavioral activity of rats in tests based on locomotion and anxiety levels. The results of this study demonstrated that experience-dependent mismatch induced cortical layer-specific changes in activity. Anxiety and exploratory activity were associated with selective changes in the number of Fos-activated neurons in the deep and superficial cortical layers, but were not associated with the total number of Fos-expressing cortical neurons in this area of the brain. We found a significant effect of anxiety and exploratory activity on learning rate. We argue that individual differences in learning can be predicted by the respective behavioral tests to measure exploratory and anxiety-related behavior. Although using neuroscience to develop artificial intelligence (AI) may guide neural network models toward human-like learning, at the moment artificial neural networks differ from the nervous system in many significant functional patterns. In order to create AI with human-like cognitive abilities, neuro- and cognitive sciences should participate in AI research as a part of the joint research program. Keywords: Individual Differences · Anxiety · Exploratory Activity · Mismatch · Behavioral Phenotyping · Learning · Cortical Layers · c-Fos
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 323–333, 2023. https://doi.org/10.1007/978-3-031-44865-2_35
324
A. I. Bulava et al.
1 Introduction Using neuroscience to develop artificial intelligence (AI) may guide neural network models toward human-like learning [1]. For example, a computational framework for reinforcement learning has been developed, in which reward prediction errors are used as key learning signals [2], as a hierarchy of modules learning to predict behavior by reinforcement coming from the dopamine system of the midbrain [3]. Within machine learning, researchers study ways of coordinating synaptic updates to improve performance in artificial neural networks; use deep networks with many layers of neurons; define an error function that quantifies, evaluates the result and then they search for learning algorithms that compute synaptic changes that reduce the error [4]. A learning scheme for hierarchical behavior planning was proposed using reverse engineering of the corticostriatal system [3]. Nevertheless, it should be noted that the nervous system differs from artificial neural networks in many significant functional patterns [5]. For example, there is no mapping between layers in an artificial network and layers in the cortex. Besides, similar cell types differ in gene expression patterns between different cortical areas [6]. Moreover, cortical– subcortical interactions differ too. And finally, brain disease studies: how do the existing AI models are consistent with the known neuropathological and biochemical changes related to disease progression? That is, AI models in agreement with brain organization should also generate neurocognitive disorders [7] and help researchers understand why corticosteroids produce stress-related disorders, such as depression, in some individuals, while others remain healthy under similar conditions. The neuroscience data currently shared challenges the notion that function can be localized in a specific and restricted area. Behavioral task variables are distributed across multiple cortical and subcortical areas, not localized in single areas. Importantly, each of the subcortical regions developed over evolution to maintain different behavioral priorities by their anatomical connections, but behaviors are associated with global brain activity (see a recent review on corticalsubcortical interactions in goal-directed behavior [8]. This fact, in turn, creating AI with brain-like learning a lot more complicated. Immediate and early expression genes (IEGs) comprise a group of genes that are activated immediately by changes in the extra- and intracellular environment that were not experienced, as in the case of IEG c-fos. Members of the c-Fos family dimerize with the c-Jun protein to form the transcription factor AP-1, which activates the transcription of a variety of genes related to cell proliferation, differentiation, and learning. The c-fos gene and the c-Fos protein are used as a tool to study cell activation with phenotypic changes in nerve cells (see a recent review on the use of c-fos in neuroscience [9]). In our experiments, we trained the animals on different instrumental food acquisition tasks and analyzed brain activity by Fos mapping. We analyzed the density of c-Fos activated cells in the retrosplenial dysgranular cortex (RSD) [10], including analysis along the rostro-caudal axis [11] in appetitive instrumental learning. The experimental groups differed by degree of mismatch between the new experience and the previously experience in this experiment, that is, a mismatch between expected and experienced behavioral outcomes. We found that the number of Fos-expression neurons in RSD along the rostrocaudal axis was clearly nonhomogeneous. In addition, we found significant differences in this cortical activity pattern between the ‘high mismatch’ group and the
The Influence of Anxiety and Exploratory Activity on Learning in Rats
325
‘low mismatch’ group. The ‘low mismatch’ group animals showed significantly less number Fos-positive neurons in the caudal part of RSD [11]. The obtained data indicate that Fos-mapping of brain activity reflects the processes of earlier acquired memory reconsolidation and also marks differences in individual experience [10, 11]. Recently in numerous studies, it has been shown that different aspects of cortical function are provided by layer-specific synaptic dynamics, structural and connectional organization [12, 13] and by layer-specific protein levels [14]. We reasoned that identifying cortical layer-specific activity might be a more sensitive approach to analysis of experience-dependent changes in brain activity.
2 Materials and Methods 2.1 Subjects and Behavioral Task Adult Long-Evans hooded rats around 6 to 8 months of age (160 to 230 g) were housed individually in standard laboratory cages (46 × 30 × 16 sm) at room temperature of 23 °C in a light/dark cycle of 12:12 h and allowed water ad libitum. The rats were deprived of food, but kept at such a level that their weight loss did not exceed 20% of their body weight. All animal procedures were in accordance with the National Institutes of Health ‘Guidelines for the Care and Use of Animals for Experimental Procedures’ and were approved by the Russian Academy of Sciences. We used the lowest number of animals. Suffering was kept to a minimum (ethics committee decision by Institute of Psychology RAS, dated July 7, 2021). At the beginning of the study, we conducted behavioral phenotyping of experimental animals involving sequential testing of baseline exploratory activity and anxiety-like behaviour. In the present work, we used the Open Field Test (OFT) and the modified Novel Object Recognition Test (mNORT). Specific tests are listed under each heading. Each test lasted 5 min. The experimental design is illustrated in Fig. 1.
Fig. 1. Schematic diagram of experimental procedures. (a) A frame from the actual video recording during the operant food acquisition behavior (left), animal movements were identified (red rectangle), the coordinates were recorded on a PC. (b) Timeline. Rats were sacrificed on ‘mismatch day’. The brains were removed for immunohistochemical staining.
326
A. I. Bulava et al.
Open Field Test, OFT. The open-field test was used to detect spontaneous motor and exploratory activity in an open field. The behavioral test was carried out in a 48 × 48 × 50 cm square chamber. The light intensity was set at 40 to 45 lx and was equal in the different parts. Vertical activity (‘rearing’) when the paws touch the open field arena walls was indicated by us as ‘partial rearing’ and ‘full rearing’ without touching the walls, respectively. Each rat was placed in the chamber and allowed to explore freely for 5 min. Behavioral parameters such as freezing, grooming, horizontal, and vertical activity were evaluated. Modified Novel Object Recognition Test, mNORT. The NORT is based on the natural tendency of animals to explore novelty (e.g., [15]). The modified novel object recognition test was performed in a square chamber as described for OFT. The test modification was that the novel object was unexpectedly placed on the opposite side of the animal location. Behavioral variables during the mNORT included parameters such as freeze response, time interval before movement initiation (for all four paws), and object sniffing, object manipulation (e.g., using paws or mouth). The time spent on each behavioral event (duration, s) was recorded using RealTimer software (RPC Open Science Ltd). Behavioral patterns such as high freezing level in the OFT and freeze response/time interval before movement initiation in the mNORT reflected the baseline anxiety-like behavior [15–17]. Appetitive Instrumental Learning and ExperienceDependent ‘Mismatch’. Animals were trained food-acquisition behavior in a 25 × 25 × 50 cm experimental chamber, equipped with automatic feeders that are triggered when the corresponding lever (bar) is pressed. The light intensity was set of 30 to 35 lx in this task. Training was conducted daily in 30-min sessions. The food-acquisition behavioral cycle consisted of several acts: pressing the lever (bar); lowering the head, and taking food from the feeder. We trained rats until they had reached a stable level of performance. Instrumental task was considered acquired if an animal performed ten cycles in a row (the learning criterion, designated by us as ‘learning day #’). This present stage was repeated the next day if the animal did not reach the criterion. After the animals were proficient at the task, they practiced it for five days. Therefore, the total number of days spent in the experimental box ranged from 10 to 14. Data recording and analysis were performed using custom software developed by Volkov S.V. (Fig. 1a, [18]). On the final training day, the ‘mismatch’ was modeled by the impossibility to perform this behavior, for which the lever was removed from the experimental chamber. Seventy-five minutes after this final experimental session, the animals were euthanized and decapitated, the brains were removed and flash-frozen in liquid nitrogen to store at −72 °C. Immunohistochemistry, IHC. Serial 16-µm coronal cryostat brain sections were collected on slides across the RSD (−2.64 to − 4.80 mm from Bregma, Fig. 2b). c-Fos and NeuN proteins were detected by indirect immunohistochemical techniques (immunoperoxidase staining of formalin fixed tissues). Primary monoclonal mouse antibodies anti-cFos (1:200; E-8 Santa Cruz Biotechnology, USA) and anti-neuronal nuclei NeuN (1:100; A60 MAB377, CA, USA) were used.
The Influence of Anxiety and Exploratory Activity on Learning in Rats
327
Cell Counting. Images of RSD cortical layers I-III and V-VI (identified by NeuN staining, Fig. 2a) were visualized using an Axiostar Plus microscope and captured by an Axiocam camera with Zeiss Efficient Navigation software (Carl Zeiss, Germany). The quantification of labeled cells in immunohistochemical sections was counted in 1 mm2 using the Image-Pro Plus analysis system (Media Cybernetics, USA). To measure the number of c-Fos activated cells, at least ten sections per rat were analyzed. Statistical Analysis. All statistical analyses were performed using Statistica 12.0 (StatSoft Inc., USA). The correlation analysis was performed using the non-parametric Spearman test, a significance level was established at p < 0.05.
3 Results To determine individuality parameters and assess the extent of relationship between different parameters, we analyzed the behavioral activity of rats in the OF and mNOR tests based on measuring locomotion and anxiety levels. The correlations between behavioral parameters and c-Fos expression levels in layers I-III and V-VI of RSD following mismatch are shown in Table 1 and Fig. 2c,d. We found a direct association between exploratory activity (such as full rearing in the OFT and object sniffing in the mNORT) and an increase in the number of c-Fos positive neurons in layers V-VI, but not I-III. At the same time, freezing in the OFT was associated with a decrease of the number of c-Fos positive neurons in layers V-VI (see Table 1). Object manipulation (using paws or muzzle) in the mNORT was not related to sniffing this object, as well as total locomotor activity. However, sniffing an object and full rearing were associated with differences in learning rate food-acquisition behavior, that is, with high exploratory activity, fewer days were required for training. On the contrary, the time interval before movement initiation in the mNORT directly correlated with the number of days for learning required by the animal. This may mean that the high baseline anxiety level affected the learning of this task. Table 1. Relationship between behavioral variables and the number of c-Fos positive neurons in the RSD layers. Variables
n
r
p
OFT, Partial rearing, s: c-Fos(+) RSD L 16 I-III
0.419
0.105
OFT, Full rearing, s: c-Fos(+) RSD L I-III
16
0.228
0.395
OFT, Partial rearing, s: c-Fos(+) RSD L 16 V-VI
0.477
0.061 (continued)
328
A. I. Bulava et al. Table 1. (continued)
Variables
n
r
p
OFT, Full rearing, s: c-Fos(+) RSD L V-VI
16
0.871*
OFT, Freeze, s: c-Fos(+) RSD L V-VI
16
-0.549*
0.027
OFT, Freeze, s: mNORT, Before moving, s
35
0.469*
0.004
OFT, Freeze, s: mNORT, Freeze Response, s
35
0.647*
0.00003
mNORT, Object sniffing, s: c-Fos(+) RSD L I-III
16
mNORT, Object sniffing, s: c-Fos(+) RSD L V-VI
16
0.592*
0.015
mNORT, Freeze Response, s: mNORT, 35 Before moving, s
0.448*
0.007
Learning day#: mNORT, Before moving, s
23
0.441*
0.035
Learning day#: Partial rearing, s
24
−0.075
−0.057
0.00001
0.782
0.788
Learning day#: Full rearing, s
24
−0.559*
0.004
Learning day#: Object sniffing, s
24
−0.573*
0.003
* Spearman
Thus, behavioral patterns such as high vs. low anxiety-related behavior and high vs. low exploratory activity, obtained in behavioral tests might predict individual differences in learning and other challenging situations. High exploratory activity (taking into account species ethology, such as a novel unexpected object sniffing, but not manipulating this object) and low anxiety contribute to rapid learning in this task. However, in some cases, rats with high anxiety values demonstrated rapid learning, but only if it was accompanied by hyperactivity (e.g., rats manipulated an object with paws, muzzle, or mouth in mNORT). Perhaps the hyperactivity of these rats replaced exploratory activity in this type of task.
The Influence of Anxiety and Exploratory Activity on Learning in Rats
329
Fig. 2. (a) Representative photomicrographs of IHC-prepared tissue show NeuN stained areas in layers of the dysgranular retrosplenial cortex, RSD. Coronal sections with a thickness of 16 µm. Scale bar = 200 µm. Examples of NeuN-positive cells are indicated by arrows. (b) Schematic diagram of the rat brain, sagittal and frontal planes [19], region of interest are indicated. (c) Graphic depicting the correlations between behavioral variables in the OF/mNOR tests and the number of days required for appetitive instrumental learning. # r = -0.559, p = 0.004; ## r = -0.573, p = 0.003. (d) Graphs showing the correlations obtained between behavioral variables and Fos expression levels in RSD. The abscissa represents the number of cells in 1 mm2 . *r = 0.871, p = 0.00001; **r = 0.592, p = 0.015.
4 Discussion Anxiety level is determined by quantifying animal movement with OFT by periphery and central time ratio, as well as distance traveled in the central versus peripheral regions [16]. Even though the OFT is considered a standardized test, there is variability in some parameters such as the size, shape, level of illumination of the open-field arena, etc. [20]. Therefore, behavioral testing does not always yield similar results in different laboratories and/or in different experimenters. Nevertheless, in the open-field test, overall consistent results were achieved by different laboratories, as long as standardizations protocol of testing procedures was involved. In other behavioral tests, significant inconsistencies in the data were obtained [21]. To ensure the reliability of behavioral phenotyping and reproducibility of results between different laboratories, suitable controls should be involved in animal maintenance and testing procedures. The standardization
330
A. I. Bulava et al.
we applied allowed us to obtain consistent results collected by more than two experimenters. Behavior patterns, such as high vs. low anxiety-related behavior and high vs. low exploratory activity predicted individual differences in learning rate. We found a significant relationship between individual differences identified by behavioral phenotyping and layer-specific c-Fos activation of RSD. Ontogenetic development can be considered as a process of increasing differentiation along with an increase in the number of life events and an expansion of the behavioral repertoire. Research in our laboratory includes recording of neuronal spikes from the animal brain during appetitive instrumental behavior. We classified neurons according to how discriminatory their firing is in relation to behavioral acts (see [22] for more details). For example, in the RSD most specialized units were classified as belonging to the ‘new’ behavioral acts systems. In contrast, relatively more neurons in the motor cortex were attributed to ‘old’ systems (e.g., context-independent neuronal firing during a particular movement). In one of these experiments, recording of single-unit neural activity have been used to study the effect of acute ethanol on the motor and limbic cortices of freely moving rabbits. We found that the number of active neurons of the limbic cortex decreased more in layers I-III than that in deep layers. In addition, a shift in the percentage of active neurons from layers I-III to layers V-VI in the motor cortex during alcohol intoxication was revealed [23, 24]. Although the biogenetic law ‘Ontogeny recapitulates phylogeny’ in some parts not consistent with recent advances in developmental biology, the underlying fundamental consideration relationships between phylogeny and ontogeny are indisputable. Indeed, it is known that layers II–VI of the mammalian neocortex are formed in such a way that early-born neurons reside in the deepest layers, whereas later-born neurons migrate past existing layers to form the superficial layers [25]. Also, late-born neurons are less resistant to damaging factors. For example, the pattern of neuropathology following acute intoxication with nerve agents is characterized by necrosis of neurons in the piriform cortex, which often affects layers II and III and involves the adjacent entorhinal cortex deep layers. In less affected neocortical areas, neuronal death was restricted to superficial layers I–III [26]. With intoxication, which is considered ‘acute stress’ as with typical stress, glucocorticoids are secreted, which can induce certain biochemical alterations and epigenetic changes when critical values reach brain regions (such as the hippocampus, neocortex, amygdala, etc.), including due to excessive activation of the hypothalamic-pituitaryadrenocortical system and pro-inflammatory processes [27]. Stress and anxiety are well known to have interrelated neurobiological bases (e.g., [28]). In rodents, anxiety levels have been shown to be also related to decreased cognitive flexibility [29]. Numerous studies have provided evidence of the link between cognitive deficits in anxiety and neocortex dysfunction. In the present study, we found that experience-dependent mismatch induced layerspecific changes in Fos activity in the retrosplenial cortex during instrumental learning. In particular, increased anxiety-like behavior is associated with a decrease in the number of neurons involved in learning in the layers V-VI of RSD. Furthermore, increased exploratory activity is associated with an increase in the number of such neurons in layers V-VI, but not in layers I-III.
The Influence of Anxiety and Exploratory Activity on Learning in Rats
331
Thus, the results presented here can confirm the assumption that phylogenetically older neurons dominate in the deep layers of the neocortex, whose activity is associated with ‘older’ low-differentiated systems (related to spatial localization, exploratory activity, etc.) are dominated. Phylogenetically, the ‘younger’ morphological part (layers I-III) of the neocortex consists of a larger number of neurons associated with more highly differentiated experiences. It is shown that under stress and acute alcohol intoxication, the subject is characterized by deactualized (i.e., deactivation) of highly differentiated systems [30, 31]. Mismatch is a typical characteristic for some conditions such as stress, intoxication, or learning. All these cases are associated with a common mechanism – reversible dedifferentiation, that manifested in the shift to ontogenetically ‘older’ memory by blocking ‘younger’ memory, and narrowing attention [30, 31]. These aspects of adaptation are interpreted as a survival mechanism formed by evolution. Compared to ‘younger’ memory, ‘older’ memory is cognitively less demanding and more efficient. Perhaps the shift to ‘older’ memory is necessary to save the cognitive resources that are needed to cope with stress [32]. Thus, stress promotes a shift from complex differentiated behavior to implementation of simpler actions [30–32]. Our conclusions agree with the hypothesis that some interventions such as exercise or cognitive tasks induce ‘activity-dependent gene programs’ in various cell types, which are then transformed into functionally younger cell phenotypes by modification of epigenetic regulators and transcription of IEGs (see more details in [33]). Thus, the process of dedifferentiation described here also manifested itself at the gene expression levels that underlie adaptive behavioral modifications. Perhaps, individual variations may indicate the role of differences in pre-experimental experience in how the brain provides these modifications. Using the mathematical modelling method [30, 31], we tested and demonstrated that stress-induced dedifferentiation can reliably accelerate learning in a new task due to the fact that the ‘existing experience’ (‘unsuitable’ for a new task) is deactualized, and the individual focuses on the current task. Experience and current state-dependent algorithms (such as stress-induced shift to implementation of simpler actions) are used by the brain to find a better solution. That, in turn, makes AI design with brain-like learning unequal and dissimilar ultrafast machine learning, even if with a deep architecture of multilayer neural networks. Programs are offered to develop models of the psyche and mind in order to reproduce the evolution of the cognitive abilities of mammals from rodents to primates and to humans. In order to succeed in creating AI with human-like cognitive abilities, neuro- and cognitive sciences should participate in AI research as a part of the joint research program [3, 5].
5 Conclusions Behavior patterns, such as high vs. low anxiety-related behavior and high vs. low exploratory activity, can predict individual differences in learning rate. We found a significant effect of anxiety and exploratory activity on learning rate. Experience-dependent mismatch induced cortical layer-specific Fos activity changes during instrumental learning. Anxiety and exploratory activity were associated with selective changes in the number of Fos-activated neurons in the deep and superficial
332
A. I. Bulava et al.
cortical layers, but were not associated with the total number of Fos-expressing cortical neurons in this area of the brain. Our research has highlighted the importance of a more comprehensive analysis of individual differences and related brain activity. In order to create artificial systems with human-like cognitive abilities, neuro- and cognitive sciences should participate in AI research as a part of the joint research program. Acknowledgments. The research was supported by RSF (project No. 22-18-00435), Institute of Psychology RAS.
References 1. Ullman, S.: Using neuroscience to develop artificial intelligence. Science 363(6428), 692–693 (2019) 2. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018) 3. Shumsky, S.A.: Reverse engineering the brain based on machine learning. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) Advances in Neural Computation, Machine Learning, and Cognitive Research IV. Studies in Computational Intelligence, vol. 925, pp. 3–9. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-60577-3_1 4. Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nat. Rev. Neurosci. 21(6), 335–346 (2020) 5. Fan, J., Fang, L., Wu, J., Guo, Y., Dai, Q.: From brain science to artificial intelligence. Engineering 6(3), 248–252 (2020) 6. Alexandrov, Y.I., Pletnikov, M.V.: Neuronal metabolism in learning and memory: the anticipatory activity perspective. Neurosci. Biobehav. Rev. 137(104664), 1–9 (2022) 7. Genon, S., Reid, A., Langner, R., Amunts, K., Eickhoff, S.B.: How to characterize the function of a brain region? Trends Cogn. Sci. 22(4), 350–364 (2018) 8. Cruz, K.G., Leow, Y.N., Le, N.M., Adam, E., Huda, R., Sur, M.: Cortical-subcortical interactions in goal-directed behavior. Physiol. Rev. 103(1), 347–389 (2023) 9. Lara Aparicio, S.Y., et al.: Current opinion on the use of c-Fos in neuroscience. NeuroSci 3(4), 687–702 (2022) 10. Svarnik, O.E., Bulava, A.I., Alexandrov, Y.I.: Expression of c-Fos in the rat retrosplenial cortex during instrumental re-learning depends on the number of stages of previous training. Front. Behav. Neurosci. 7, 78 (2013) 11. Bulava, A., Alexandrov, Y.: Reconsolidation and cognitive novelty. In: Velichkovsky, B.M., Balaban, P.M., Ushakov, V.L. (eds.) Intercognsci 2020. AISC, vol. 1358, pp. 504–509. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-71637-0_58 12. Wang, L., Kloc, M., Gu, Y., Ge, S., Maffei, A.: Layer-specific experience-dependent rewiring of thalamocortical circuits. J. Neurosci. 33(9), 4181–4191 (2013) 13. Englund, M., Krubitzer, L.: Phenotypic alterations in cortical organization and connectivity on different time scales. Brain Behav. Evol. 97(1–2), 108–120 (2022) 14. MacDonald, M.L., et al.: Laser capture microdissection–targeted mass spectrometry: a method for multiplexed protein quantification within individual layers of the cerebral cortex. Neuropsychopharmacology 44(4), 743–748 (2019) 15. Antunes, M., Biala, G.: The novel object recognition memory: neurobiology, test procedure, and its modifications. Cogn. Process. 13(2), 93–110 (2012). https://doi.org/10.1007/s10339011-0430-z
The Influence of Anxiety and Exploratory Activity on Learning in Rats
333
16. Belovicova, K., Bogi, E., Csatlosova, K., Dubovicky, M.: Animal tests for anxiety-like and depression-like behavior in rats. Interdiscip. Toxicol. 10(1), 40–43 (2017) 17. Landgraf, R., Wigger, A.: High vs. low anxiety-related behavior rats: an animal model of extremes in trait anxiety. Behav. Genet. 32(5), 301–314 (2002) 18. Bulava, A.I., Volkov, S.V., Alexandrov, Y.I.: A novel avoidance test setup: device and exemplary tasks. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) Advances in Neural Computation, Machine Learning, and Cognitive Research III. Studies in Computational Intelligence, vol. 856, pp. 159–164. Springer, Cham. (2020). https://doi.org/ 10.1007/978-3-030-30425-6_18 19. Paxinos, G., Watson, C., Carrive, P., Kirkcaldie, M.T.K., Ashwell, K.: Chemoarchitectonic Atlas of the Rat Brain, p. 375. Elsevier. USA (2009) 20. Gould, T.D., Dao, D.T., Kovacsics, C.E.: The open field test. Mood and anxiety-related phenotypes in mice. Neuromethods 42, 1–20 (2009) 21. Lewejohann, L., et al.: Environmental bias? Effects of housing conditions, laboratory environment and experimenter on behavioral tests. Genes Brain Behav. 5(1), 64–72 (2006) 22. Alexandrov, Y., Grinchenko, Y., Jarvilehto, T., Laukka, S., Maz, V.N.: Acute effect of ethanol on the pattern of behavioral specialization of neurons in the limbic cortex of the freely moving rabbit. Acta Physiol. Scand. 140, 257–268 (1990) 23. Alexandrov, Y., Grinchenko, Y., Laukka, S., Jarvilehto, T., Maz, V.N.: Acute effects of alcohol on unit activity in the motor cortex of freely moving rabbits: comparison with the limbic cortex. Acta Physiol. Scand. 142(3), 429–435 (1991) 24. Nadarajah, B., Parnavelas, J.G.: Modes of neuronal migration in the developing cerebral cortex. Nat. Rev. Neurosci. 3(6), 423–432 (2002) 25. Siso, S., et al.: Editor’s highlight: spatiotemporal progression and remission of lesions in the rat brain following acute intoxication with diisopropylfluorophosphate. Toxicol. Sci. 157(2), 330–341 (2017) 26. Gulyaeva, N.V.: Stress-associated molecular and cellular hippocampal mechanisms common for epilepsy and comorbid depressive disorders. Biochemistry 86(6), 641–656 (2021). https:// doi.org/10.1134/S0006297921060031 27. Daviu, N., Bruchas, M.R., Moghaddam, B., Sandi, C., Beyeler, A.: Neurobiological links between stress and anxiety. Neurobiol. Stress 11, 100191 (2019) 28. Park, J., Moghaddam, B.: Impact of anxiety on prefrontal cortex encoding of cognitive flexibility. Neurosci. Cognit. Flexibil. Dev. Dis. Treat. 345, 193–202 (2017) 29. Alexandrov, Y.I., Sozinov, A.A., Svarnik, O.E., Gorkin, A.G., Kuzina, E.A., Gavrilov, V.V.: Neuronal bases of systemic organization of behavior. In: Albert Cheung-Hoi, Y., Li, L. (eds.) Systems Neuroscience. AN, vol. 21, pp. 1–33. Springer, Cham (2018). https://doi.org/10. 1007/978-3-319-94593-4_1 30. Alexandrov, Y., et al.: Regression I. Experimental approaches to regression. J. Anal. Psychol. 65(2), 345–365 (2020) 31. Alexandrov, Y., et al.: Regression II. Development through regression. J. Anal. Psychol. 65(3), 476–496 (2020) 32. Schwabe, L., Wolf, O.T.: Stress and multiple memory systems: from ‘thinking’ to ‘doing.’ Trends Cogn. Sci. 17(2), 60–68 (2013) 33. Lissek, T.: Activity-dependent induction of younger biological phenotypes. Adv. Biol. 6, 2200119 (2022)
Applications of Neural Networks
Image Processing with Reservoir Neural Network Mikhail S. Tarkov1,2(B) and Victoria V. Ivanova2 1 Rzhanov Institute of Semiconductor Physics SB RAS, Novosibirsk, Russia
[email protected] 2 Novosibirsk State University, Novosibirsk, Russia
Abstract. Reservoir neural network (RNN) is a powerful tool for solving complex machine learning problems. The reservoir is a recurrent part of the network having a large size and rare internal connections which are most often set randomly and remain fixed. The idea of the RNN is to train only part of the network using a simple classification/regression technique and leave most of the network (reservoir) fixed. At the same time, all RNN advantages are preserved, and the training time is significantly reduced. The work performed optimization and research of methods that improve the reservoir ability to solve problems of image classification. These methods are based on the reservoir output data transformation before they are fed to the RNN output layer. In the work, the optimal parameters values for the methods Infomax and SpaRCe were obtained, which provide a minimum error in image classification. Using the example of image classification from the MNIST handwritten digit database, it is shown that: 1. Reservoir networks are trained much faster than convolutional networks, although they are inferior to the latter in terms of image classification accuracy. 2. ESN (echo-state network) with principal component projector (PCA) gives more accurate results than ESN, Infomax and SpaRCe networks, but is slower. Keywords: neural networks · convolutional networks · reservoir · image recognition
1 Introduction The image classification problem is one of the main and urgent computer vision problems. A person easily identifies an object in an image, but erroneous classification of images by a computer is possible. In order to reduce the training time and improve the classification accuracy, new algorithms are being created. Also, some other computer vision tasks (for example, object detection, segmentation) can be reduced to an image classification task. They have many practical applications: face recognition, diagnosis of patients based on data obtained using microscopy, radiography, angiography, ultrasound and tomography. For an ordinary user, automatic sorting of images in the phone or filters for processing photos may be of interest. As practice shows, algorithms based on machine learning do a good job of classifying images. Various neural network types can be used to solve the image classification © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 337–345, 2023. https://doi.org/10.1007/978-3-031-44865-2_36
338
M. S. Tarkov and V. V. Ivanova
problem. Convolutional neural networks (CNNs) are often used to solve this problem [1, 2]. Convolution and subsampling (pooling) operations reduce the dimension of feature maps and increase their number. With CNN you can move from the specific features of an image to more abstract details. In addition to CNN, one can use multilayer sigmoidal neural networks with direct signal propagation (multilayer perceptron, MLP [3, 4]) or reservoir neural networks ([5–11]). Multilayer sigmoidal neural networks can model almost any function due to the fact that all adjacent layers are fully connected to each other, and all connections are trainable. Reservoir neural networks (RNNs) are a powerful tool for solving complex machine learning problems. The reservoir is a recurrent part of the network, which features a large size and rare internal connections, which are most often set randomly and remain fixed. The RNN idea is to train only part of the network using a simple classification/regression technique and leave most of the network (reservoir) fixed. At the same time, all RNN advantages are preserved, and the training time is significantly reduced. RNNs are often used in tasks where there is a time dependence, since the reservoir is recurrent in nature, due to which it is possible to establish the temporal features of the input sequence. However, these networks can also be used for image classification. Since the connections in the reservoir remain fixed, learning is faster than for CNNs or MLPs. In this paper, we consider several types of neural networks from the class of echo state networks (ESN) using reservoir computing. The image data is transmitted to the network in columns. The work objective is to develop and study methods that improve the reservoir ability to solve image classification problems. These methods are supposed to be based on the transformation of the reservoir output data before they are fed to the RNN output layer. In accordance with the goal, the following tasks were solved: 1. The accuracy and performance of three ESN algorithms and one CNN algorithm were compared. 2. Experiments were carried out with different values of hyperparameters. Main interest is the comparison of two RNN types: Infomax [5] and SpaRCe [6], which independently showed similar results when tested on time series. The work contribution is to compare the Infomax and SpaRCe methods in image classification.
2 Neural Networks for Image Processing An image classification problem is a problem of assigning one label from a fixed set to an input image. There is a set of images containing one object and belonging to some classes in accordance with the object depicted on them. We need to determine which category each image belongs to. A multilayer sigmoidal neural network (MLP) [3] consists of at least three layers: one input layer, one or more hidden layers, and one output layer. The hidden layer nodes are fully connected to the input layer, and accordingly, the output layer is fully connected to the hidden layer. Depending on the modeled function complexity, it becomes necessary to add hidden layers. As new layers are added to the network, the error gradients begin to decrease greatly, leading to the problem of gradient fading. One of the first CNN examples is LeNet-5 [1]. They contain three types of layers: convolutional layers, subsampling (pooling) layers, and fully connected layers (Fig. 1).
Image Processing with Reservoir Neural Network
339
Fig. 1. LeNet-5 architecture [1] on the example of the MNIST digit recognition problem
As seen in Fig. 1, convolutional layers and subsampling layers are interleaved. Each of them reduces the image matrix. A convolutional layer is the main CNN building block. The weight matrix, called the convolution kernel, is “moved” along the processed layer, and when the weight matrix is multiplied by a matrix of the same size, the value for the next layer is obtained from the layer fragment. There can be several channels at the output, which increases the number of feature maps. Subsampling compacts a group of pixels down to a single pixel by some non-linear function such as maximum, average, etc. A pooling layer is usually placed between convolution layers. These layers generate image features. At the end, there are a large number of channels that store several parameters. Next the input image is classified based on the detected features. For this, fully connected layers are used. CNN is considered as one of the best image classification algorithms. Compared to a fully connected neural network, where each pixel has its own weight, fewer weights are configured here, since one matrix is used for the entire image. The algorithm is relatively resistant to rotation and shift. However, the convolutional network consists of several hidden layers, which can increase the training time. Another problem is the large number of variable network parameters. All parameters significantly affect the result, but are chosen by the researchers at their will.
3 Reservoir Neural Networks A feature of reservoir neural networks is their relative simplicity in structure and training, since the connections of one of the layers (reservoir) do not change during training. At the same time, the RNNs advantages are preserved. The general scheme of the RNN is shown in Fig. 2. The simplest reservoir network should have at least two inner layers: a reservoir and a readout layer. Feedbacks between the readout layer and the reservoir are also allowed. The reservoir is the recurrent part of the network, the main features of which are rare internal connections and large size. The internal connections of the reservoir are most often set randomly and remain fixed. You can use the recurrent nature of the reservoir to recognize the temporal features of the input sequence. When certain restrictions on internal connections are met, the reservoir is not subject to the damped gradient problem. By virtue of its recurrence, the reservoir depends on its own output, which is called the reservoir state. One of the networks that uses the reservoir computing approach is the echo state network (ESN). Varieties of this network are studied in [5–10]. The classical ESN is
340
M. S. Tarkov and V. V. Ivanova
Fig. 2. Architecture of a reservoir neural network with feedback
described in the works of G. Jaeger [7, 8] and is one of the simplest variants of reservoir networks. The new reservoir state is calculated based on the input, the previous state of the reservoir, and the readout layer output (if there are feedbacks): Y ∗ (n + 1) = f (Win X (n + 1) + Wr Y (n) + Wofb Z(n)), Y (n + 1) = (1 − δC)Y (n) + δCY ∗ (n + 1), where X (n + 1) is the vector of input data, Win is the matrix of input weights depending on the task, Wr is a fixed sparse random matrix describing relationships within the reservoir, Wofb is a matrix describing the connections between the readout layer and the reservoir; Z(n) is the output of the readout layer; δC is the leakage coefficient, allows you to determine the network forgetting degree; f is the activation function. In this work, the sigmoidal activation function f (u) = 1 1+exp(−u) is used. The reading layer (Readout) is the module in which the main training takes place. This is usually a neural network or a linear transformation that is tuned according to the supervised learning principles. In this work, for all networks, a linear readout layer is used, tuned by the ridge regression method (also ridge regression or Tikhonov regularization) to minimize the cost function. Ridge regression is used to deal with the correlation of independent variables and reduce the dimension. The network has the echo states property if the reservoir current state is uniquely determined by the infinite history of input data, which means that there is no implicit horizon inside the network. Accordingly, the damped gradient problem is solved. Jaeger [7] proved that the network have the echo states property if the spectral radius of the matrix Wr be less than 1. For the network to not have the echo states property, it is sufficient that the largest singular value of the matrix Wr be greater than 1. According to empirical observations, for the network to work better, the matrix Wr should be very sparse (about 1–5% of non-zero elements). However, classical ESN has fewer trainable weights than neural networks with several fully connected layers, so the classification problem is solved by ESN with less accuracy than using such neural networks.
Image Processing with Reservoir Neural Network
341
4 Hyperparameter Optimization Hyperparameters are model tuning parameters that can be optimized. Based on the hyperparameters, the network is trained. For example, for reservoir neural networks, the reservoir size is a hyperparameter. Other examples of hyperparameters: matrix spectral radius; matrix sparseness; random weights distributions; leakage factor. In this work, to determine the optimal hyperparameters, a genetic algorithms [9] were used: 1. The range in which the parameters will change is selected. 2. An initial set (generation) of parameters (individuals) is generated with random values within the range. Each generation has 200 individuals. 3. For each set, a network is generated, which is tested on the same data set for all. 4. A sample of 20 individuals of the best sets is made and the worst set is selected. The best and worst sets are determined by the data set classification accuracy when using a neural network with an appropriate set of hyperparameters. 5. A new generation is based on the selected sets: all the sets that were previously in the sample get into the new generation, then the worst sets are crossed with all the sets from the sample, and their descendants fall into the new generation. The remaining positions are occupied by descendants of random sets from the sample. When crossing, a 50% chance of inheriting traits from one of the parents was used, and the offspring for each parameter from the set had a 10% chance of mutation to a random value within a given range.
5 Neural Networks Using PCA Projector An additional module appears between the reservoir and the readout layer, which, at the training stage, accumulates the reservoir states statistics and projects them onto a space of smaller dimensions (Fig. 3).
Fig. 3. ESN architecture using a dimensionality-reducing projector
Vectors from the reduced space are fed to the reading layer input and the network output is calculated. The idea of the PCA (principal component analysis) method is to select such a linear manifold that would have a lower dimension than the original data set and approximate them with the least deviation [7]. With the correct choice of the ratio of the reduced space dimension and the reservoir dimension, the smallest components
342
M. S. Tarkov and V. V. Ivanova
are discarded. Since they are likely due to noise, the algorithm becomes more robust to noise. The Infomax method [5] is based on solving the maximum optimization problem, where the objective function is the mutual information [10] between two preselected random variables: I (X ; Y ) = H (X ) + H (Y ) − H (X , Y ) = H (Y ) − H (Y |X ), where H (X ) is the entropy of the corresponding random variable X , H (Y |X ) is the conditional entropy. In this work, mutual information is measured between the sequence of inputs and the reservoir state. Additional validation parameters are introduced into the reservoir state which will be used for optimization. If there is no noise in the network, then the mutual information is directly proportional to the entropy of the reservoir state, and the input data entropy will be a bias constant which will not affect the optimization in any way, since the input data remains unchanged during training. The conditional entropy describes the network noisiness and does not depend on the free parameters. Thus, it is necessary to maximize the reservoir state entropy. In [6] another approach is formulated to improve the reservoir ability to classify. SpaRCe uses learning thresholds to optimize the network sparsity level. Both trainable thresholds and readout layer weights (but not recurrent reservoir connections) are optimized by minimizing the cost function. The sparsity level is achieved due to the presence of trainable response thresholds, and not penalty conditions. Unlike the previous models, which determine the neural network output by reading the vector Y , i.e. the reservoir activity considered for the learning process, another variable is introduced for each dimension, the sparsity level of all states is considered as an additional hyperparameter. Thresholds are tuned by gradient descent rules.
6 Experiments To solve the MNIST data set classifying problem (Fig. 4), the following neural networks are programmatically implemented: • • • • •
Classical ESN; Classical ESN with PCA projector; Infomax model; SpaRCe model; Convolutional neural network.
The data was divided into several parts: data for the reservoir to become stable; validation data for tuning reservoir hyperparameters; data for training the reading layer; and data for testing and verifying the final error of the network. For reservoir neural networks, using a genetic algorithm, the following were calculated: reservoir size, spectral radius, leakage coefficient, free network parameter Infomax, sparsity level of the SpaRCe network. The NRMSE cost function was used as a classification accuracy criterion: ypredict − yexact 2 NRMSE = yexact − yexact 2
Image Processing with Reservoir Neural Network
343
Fig. 4. An example of MNIST image classification by Infomax and SpaRCe networks. (a) Real MNIST labels on this set; (b) MNIST labels obtained by the Infomax classifier; (c) MNIST labels obtained by the SpaRCe classifier.
where yexact is the desired result, and ypredict is the actual result. Thresholds in the SpaRCe method were calculated using gradient descent. The classification accuracy was calculated as the ratio of the number of correctly identified images to the number of all images. The program that implements the networks is written in Python using the open libraries NumPy, SciPy, PyTorch, EchoTorch. Tools from the built-in Python library were used to measure the uptime of each network. The work of several methods of image classification is compared. The results of comparing the work of various classifiers are presented in Table 1: Table 1. Results of various ESN models Network
Error, %
Training time, sec
Classical ESN
3.66
344.19
ESN with PCA
1.73
593.28
Infomax
2.79
358.75
SpaRCe
2.15
376.84
CNN
1.22
754.79
The experiments were carried out without dependence on time, since convolutional networks do not support this possibility. The convolutional neural network gives the most accurate results among the presented models, but at the same time it is much slower to learn than the rest. ESN with a PCA projector is slightly inferior in accuracy
344
M. S. Tarkov and V. V. Ivanova
to a convolutional network, but in terms of training time it is much faster. Classical ESN learns the fastest, but it also has the largest classification error. Infomax and SpaRCe differ slightly in results: the Infomax network is slightly faster but less accurate than SpaRCe. At the same time, in terms of recognition accuracy, the SpaRCe network is almost as good as ESN with a PCA projector.
7 Conclusion Software implemented: convolutional neural network and four variants of reservoir neural networks: classical ESN, ESN with PCA projector, Infomax and SpaRCe. On the problem of image recognition from the MNIST set, the results of the above networks were compared in terms of recognition accuracy and speed. Shown, that: • A convolutional neural network has the highest recognition accuracy, but is much slower than reservoir networks. • ESN with PCA projector gives more accurate results than ESN, Infomax and SpaRCe networks, but is slower. • SpaRCe network is more accurate than ESN and Infomax networks but is slower. • In terms of recognition accuracy, SpaRCe network is almost as good as ESN with PCA projector. • In terms of recognition accuracy, ESN with PCA projector is almost as good as a convolutional network.
References 1. Golovko, V., Egor, M., Brich, A., Sachenko, A.: A shallow convolutional neural network for accurate handwritten digits classification. In: Krasnoproshin, V.V., Ablameyko, S.V. (eds.) PRIP 2016. CCIS, vol. 673, pp. 77–85. Springer, Cham (2016). https://doi.org/10.1007/9783-319-54220-1_8 2. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Echo state networks-based reservoir computing for MNIST handwritten digits recognition. Proc. IEEE 86, 2278–2324 (1998) 3. Hadj-Youcef, A.: Deep learning for image classification w/ implementation in PyTorch. https://towardsdatascience.com/convolutional-neural-network-for-image-classific ation-with-implementation-on-python-using-pytorch-7b88342c9ca9. Accessed 5 May 2023 4. Rumelhart, D.E., McClelland, J.L. (eds.): Parallel Distributed Processing: Explorations in the Microstructures of Cognition. MIT Press, Cambridge (1986) 5. Tarkov, M.S., Chernov, I.A.: Time series prediction by reservoir neural networks. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) NEUROINFORMATICS 2020. SCI, vol. 925, pp. 303–308. Springer, Cham (2021). https://doi.org/10.1007/ 978-3-030-60577-3_36 6. Manneschi, L., Lin, A.C., Vasilaki, E.: SpaRCe: improved learning of reservoir computing systems through sparse representations. IEEE Trans. Neural Netw. Learn. Syst. 1–15 (2021) 7. Jaeger, H.: Reservoir computing approaches to recurrent neural network training. Comput. Sci. Rev. 3, 127–149 (2009) 8. Jaeger, H.: The “echo-state” approach to analysing and training recurrent neural networks. Technical report 148, GMD—German National Research Institute for Computer Science (2001)
Image Processing with Reservoir Neural Network
345
9. Sivanandam, S.N., Deepa, S.N.: Introduction to Genetic Algorithms. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-73190-0. ISBN 978-3-540-73189-4 10. Bell, A.J., Sejnowski, T.J.: An information-maximization approach to blind separation and blind deconvolution. Neural Comput. 7, 1129–1159 (1995) 11. Schaetti, N., Salomon, M., Couturier, R.: Echo state networks-based reservoir computing for MNIST handwritten digits recognition. In: 2016 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC) and 15th International Symposium on Distributed Computing and Applications for Business Engineering (DCABES), Paris, France, pp. 484–491 (2016)
Investigation of a Spike Segment Neuron in the Offline Multi-Object Tracking Task with Embeddings Constructed by a Convolutional Network Ivan Fomin1(B) , Anton Korsakov1,2 , Viktoria Ivanova1,2 , and Aleksandr Bakhshiev2 1 Russian State Scientific Center for Robotics and Technical Cybernetics,
21 Tikhoretsky Pr, St. Petersburg 194064, Russia {i.fomin,a.korsakov,v.ivanova}@rtc.ru 2 Peter the Great St. Petersburg Polytechnic University, 29, Polytechnicheskaya, St. Petersburg 195251, Russia
Abstract. The multi-object tracking and matching task is one of the key problems in video analysis, especially in security video surveillance systems. There are various approaches to this problem, including fairly large number of classical algorithms. Modern approaches are focused on the detection of objects by neural networks and comparison of objects from previous frame to objects from new frame. Offline tracking involves knowing the coordinates of objects for all frames in advance. It is also assumed in the work that part of each trajectory is marked beforehand. The application of a convolutional neural network for translating images into embedding space with a small number of features, and a compartment spiking neural network for classifying examples in the embedding space is considered. The ability of the spiking neural network to learn from a single example or small number of examples allows identify objects with accuracy up to 48% for 50% of the trajectory length. Only one example from the training part of the trajectory is used to train the compartment spiking neuron model network (CSNM-net). Keywords: neural networks · compartment spiking neuron · multi-object tracking · offline tracking · object recognition · video surveillance
1 Introduction For a long time, the problem of people detection and tracking on a video image from a stationary or mobile camera has been known. This problem considering various complicating factors, like hard weather or lighting conditions, trajectory crossing, obstacles and so on. The trajectories of people’s movement can change, disappear and arise. The fundamental problem is in finding a way to generate a feature vector and create a reliable and invariant between frames way to compare known part of trajectory and its possible continuation. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 346–354, 2023. https://doi.org/10.1007/978-3-031-44865-2_37
Investigation of a Spike Segment Neuron
347
Object matching, or reidentification, is a rather specific task in video surveillance, and is required when it is necessary not only to detect movement or an intruder in a given area of the perimeter, but also to carry out its long time tracking. This can be used to identify suspiciously long-term loitering in some area [1], to compile an approximate route of a person or a car in an area where exists number of video surveillance cameras. Multiple object (in this case, people) tracking presupposes the presence of information about their position in the frame obtained manually or with the help of one or another detection method. Tracking is divided into online and offline tracking [2]. In the first case, it is assumed that each processed frame is new, and the algorithm can use only data on the position and location of objects extracted on previous frames. With offline tracking, information from previous and subsequent frames can be used to train and improve the quality of predictions. In this paper, we consider an offline tracking with some elements of an online approach: the markup for some part of the trajectory is known (from 10% to 90%), the task of the network is to predict whether object from test part belong to the continuation of a particular trajectory from the training part and one of four “metaclasses” (individual people who move in the camera field of view during recording). There are many single and multiple object tracking approaches based on different algorithms [3]. Perspective way to solve MOT problem is use CNNs to process two or more images by a standard or modified network to extract an embedding vector [2]. Matching usually performed over such vectors. Network training process organized in way to make behavior of embedding vectors and their distribution in the embedding space according to the results of training convenient for solving the problem that an author of the algorithm sets for himself. Standard matching approaches based on Euclidean distance or Cosine distance between feature vectors in embedding space. Here we propose to use bioinspired approach to feature vector matching. In [4], original biosimilar compartment spiking neuron model (CSNM) was considered, and it had good performance in the classification problem. The model is able to successfully classify even complex, linearly non-separable data, provided that the embedding vectors are relatively small and compact. This paper presents an accelerated method of training a convolutional neural network for translating images into the embedding space. The method of information encoding for a spiking network is shown and the results of people identification by a spiking network in an embedding space are presented. Our main contributions are: • A method for converting an image into embedding space of small dimension, based on siamese convolutional neural network is proposed and experimentally evaluated • An approach to train a siamese convolutional neural network on small dataset is proposed, which significantly increases a convergence rate and a quality of resulting clusters of sample points in an embedding space • The ability of a combination of two networks to divide images into groups based on appearance characteristics, which were not directly taken into account anywhere in the process of network training, is shown.
348
I. Fomin et al.
2 Siamese Networks in Object Tracking Task Basically, in the tracking and matching task, output of the “backbone” (feature extraction) part of siamese convolutional neural network (CNN) is directly used to assess vector similarity instead of the usual Euclidean (L2 ) distance. The main idea of siamese networks [5] is to train a neural network using information from two or more images to distinguish them in the best way. Usually, the feature extraction part is common for all images, it is followed by various methods of comparison, matching, separation. In [6], a siamese network trained on the contrast error function is proposed. It receives 2 images with IoU as an input and gives a similarity score as an output. In [7], the network receives fragments of images and evaluates the similarity score between them. In [8], the triplet loss function is considered, similar to what will be used in this paper. In a paper [9], authors take the next step, and by adding another example, they form a quadruple error function, with 3 examples of one object at the input for the sequence of 3 frames and 1 example of another object. In operation mode, the method uses a trained descriptor network to construct embeddings and calculates a similarity between them. In [10], a comparison is performed on a 128-dimensional embedding vector at an input of the triple convolutional network described in [11]. In [12], a convolutional network is provided for predicting a new position, which is able to extract representations from image fragments to refine the comparison when a new frame is received. In [13], a matching algorithm using GoogLeNet is shown, implemented with triplet loss. The R-FCN [14] network is used to generate assumptions about a position of an object based on the trajectory and features from previous frames, then the assumptions are combined with real detections and a trained GoogLeNet is used for refinement. Comprehensive research and comparison of presented papers and many others devoted to MOT problem, can be found at [2]. Based on considered works, and other surveys it can be unambiguously concluded that the use of convolutional networks to extract feature vectors of the desired size is a generally accepted approach that is relevant for use in this work. Among the considered approaches to training convolutional network for embedding vector formation, triplet loss was chosen as best way to learn network to form generalized well-separated clusters for each class in the embedding space. It is common in modern matching approaches as is and as part of more complex approaches, for example in face recognition task [15], person reidentification task [16] and others.
3 Compartment Spiking Neuron Model (CSNM) In the framework of this work, a Compartment Spiking Neuron Model (CSNM) was used [4]. It is assumed that spikes arrive at the input, which are converted in synapses into an output value that reflects the effect of synaptic current on a segment of the neuron membrane (Fig. 1). We suppose that neurons exchange information through events (spikes), which can be represented by the formula: 1, t ∈ [ti ; ti + t] (1) xi = 0, t ∈ / [ti ; ti + t],
Investigation of a Spike Segment Neuron
349
Fig. 1. Compartmental spiking neuron model structure. Seij l – the input synapse l, linked to the dendrite segment Dij ; Dij – j-th segment of the i-th dendrite; Bk – k-th segment of the neuron body
where x i is the output of the neuron in the i–th time interval, and t is the time. The presented structural organization of the neuron model makes it possible to calculate systems of differential equations for each neuron independently. Neurons exchange information about the moment of spike occurrence exclusively, and the input vector of the equations system in the neuron model is formed inside the model as a function of the arriving time of spikes. The size of the neuron (the number of soma segments), the set of dendrite lengths, and the number of excitatory and inhibitory synapses for each segment of the membrane determine the structure of the neuron that forms the required response to input influences. This neuron is characterized by a special method of structural learning. According to it, for a given number of soma segments, the synchronization procedure is first performed when the lengths of the dendrites are sequentially increased to ensure the simultaneous arrival of the signal to the soma. Then the normalization procedure is performed, when the number of synapses increases to compensate a drop in signal level while passing through elements of the dendrite, a combination of these operations provides tuning to the pulse pattern.
4 Dataset and Training Parameters As a source of initial data, video recording from a stationary surveillance camera of poor quality with weak color reproduction was used. Total duration of the recording is about 8 min. Within the video plot, 4 people enter the frame and go beyond the boundaries of the frame, in the process passing through various areas of the frame and performing various actions. To simplify task all actions associated with changing the body position (bends, squats, lying down) or speed (running) were excluded from consideration, leaving only a step movement in various areas and directions. 34 trajectories belonging to 4 different people were extracted in total, full number of examples is 2435. A separate “trajectory” here refers to a set of images of a person from the moment he appears in the frame to the moment he leaves it. If the same person enters the frame again, a new sequence of frames and a new trajectory are formed.
350
I. Fomin et al.
The collected data were distributed into training and test datasets according to the following principle. A certain percentage of the images of each trajectory were determined as training set, all the rest as the test set. All collected data were distributed 7 times into training and test sets – 10%/90%, 25%/75%, 40%/60%, 50%/50%, 60%/40%, 75%/25%, 90%/10%. In real tasks we want to use as small training dataset (part of trajectory) as possible. So, we should investigate different proportions to determine smallest sifficient relation. The well-known convolutional neural network ResNet-18 [17] with residual connections was used to transform people images into embedding space. The network contains 18 blocks with branching into several paths with different feature sizes and bypass connections. Version 18 is relatively small and applicable for use on different devices, including embedded versions with hardware neural computations acceleration support. The network structure with residual connections allows the data to evade some layers, so net can extract features at different scales. In some other architectures, in order to use different scales, approaches with a pyramid of features and special networks like FPN [18] are used to extract features at different image scales before transferring to the convolutional layers. To be used in this work as an encoder of points into embedding space, the last FCsoftmax layer of 1000 neurons in the ResNet-18 network, pre-trained on the ImageNet dataset, was replaced by a FC layer of two linear neurons (the sum of inputs with weights and linear displacement). Their outputs correspond to coordinates in the embedding space. The well-known triplet loss approach is used as a CNN training loss function. The essence of this approach is as follows. Input data are three images, two of which belong to the same class (in this work, one trajectory), and the third belongs to any other class. The first image is considered as an “anchor” (target class), the second as a “positive” (also target class), the third as a “negative” (any other class). All three images are processed by the embedding vector calculation network, after which three extracted vectors in the representation space are input to the loss function. The purpose of the function is to minimize the distance in the embedding space between two points belonging to the anchor and positive images, and to maximize the distance between points belonging to the anchor and the negative image. Mathematically, this is expressed by the Eq. (2), where d (., .) – L 2 norm between vectors in the embedding space, a, p, n – vectors of the anchor, positive and negative images in the embedding space, mp , mn – multipliers of positive and negative effects, m – a term determining points spread. L mp , mn , m, a, p, n = max{mp d (a, p) − mn d (a, n) + m, 0} (2) When training the network in considered works with triplet loss, it is allowed to select negative examples from any part of the training dataset. With a standard triplet loss on an unchanged set of data, examples transferred to embedding space converge to two big clusters, as shown on Fig. 2. In this case, more than half of examples in the batch have d (a, p) d (a, n) and most of the triplets have a loss lower or equal to zero. At a certain point, convergence of the loss function is determined by grouping of all classes around one or two certain points in embedding space.
Investigation of a Spike Segment Neuron
351
Fig. 2. Example of embedded points spread with standard (left) and modified (right) training approaches
To compensate this, a mechanism for a weighted choice of negative examples based on a distance between centroids of each class is proposed. This is similar to the so-called hard negative mining, when mostly misclassified examples are selected for training. At the end of the learning epoch, a centroid in the embedding space is calculated for each class, and only a certain percent of the nearest classes used to select negative examples for each class. This approach makes it possible to significantly increase the rate of loss function convergence, and to obtain a confident separation of points into clusters by the 50–100 epoch of training (see Fig. 2, right), with complete absence of convergence or convergence to the 400–500 epoch without this modification.
5 Experiments For each of considered ratios of the training/test dataset, the ResNet-18 neural network was trained to translate images into the embedding (feature) space. To train the CSNM network, all points in embedding space were normalized into the interval [0; 0.2], and a calibration input for CSNM neuron was added, identically equal to 0.2. The following scheme of a spiking neural network is chosen for training. For each of classes (trajectories) of the training set, one neuron is allocated, designed to respond to images of the corresponding class. A total of 34 neurons with 3 inputs (dendrites) in each, where 2 dendrites receive spikes with delays proportional to coordinates in the embedding space, the 3rd (calibration) dendrite receives a spike with a delay of 0.2. An important parameter of the CSNM model neuron operation is a sensitivity (threshold) of neuron activation. If the threshold is too low, neurons of different classes “compete” with each other in the response speed for different points and the fastest neuron, but not always the closest to the centroid of cluster, marked as resulting class. If the threshold is too low, the neuron not activates for points belonging to the right class, but far enough from a centroid. A threshold of 0.015 was selected experimentally to obtain optimal quality metrics. All results are shown for this threshold. In total, 7 experiments were conducted corresponding to 7 ratios of the number of training and test examples in the dataset. For each experiment, classification quality of individual classes (trajectories) and the quality of recognition of metaclasses (individual people) for the training and test set were evaluated.
352
I. Fomin et al.
Because in this work MOT problem is considered as an embedding vector (point) classification problem, we prefer to use standard classification quality metrics based on number of true and false classifications – precision (Pr), recall (Rec) and F1-score, formulas depicted in (3). Numbers of true and false classifications (TP, TN, FP, FN) required to evaluate results, so below we describe how we determined them for both experiments. TP TP 2TP , Rec = , F1 = (3) TP + FP TP + FN 2TP + FP + FN For each neuron reacting to a certain trajectory, the quality indicators were considered as TP - neuron reacts, the point of the correct trajectory; FP – neuron reacts, the point of another trajectory; FN – neuron does not react, the point of the correct trajectory; TN - neuron does not react, the point of another trajectory. When evaluating the recognition of metaclasses, quality indicators were considered to be similar, but taking into account the fact that several neurons can respond to one metaclass. Here TP means point belongs to the metaclass and any of neurons of the metaclass reacts to it; FP – point does not belong to the metaclass, but at least one of neurons of the metaclass reacts to it; FN – point belongs to the metaclass, but none of neurons of the metaclass reacts to it; TN – point does not belong to the metaclass and none of neurons doesn’t react. After experimental research of CSNM parameters choise, neuron activation threshold was set to 0.015. Spiking neuron cluster and metacluster classification results are summarized in Table 1. With an increase of the training set size, the overall accuracy of recognizing individual classes increases due to the fact that the number of points for evaluating the centroid increases, and the dataset becomes more representative. The quality of metaclass recognition is at a high level for all ratios, as expected to be. A target of research was to choose minimal appropriate size of training set. According to quality 40% training by 60% testing set is minimal appropriate for tasks where average trajectory length and overall dataset size both small (15–100 examples for trajectory and near 2500 overall in our experiments). Pr =
Table 1. Measurement results of individual trajectory and metaclass recognition Train and test set division
Training set
Testing set
Trajectories
Metaclasses
Trajectories
Metaclasses
Pr
Rec
F1
Pr
Rec
F1
Pr
Rec
F1
Pr
Rec
F1
10%/90%
0.79
0.86
0.82
0.95
0.96
0.95
0.20
0.24
0.22
0.83
0.85
0.84
25%/75%
0.70
0.78
0.74
0.94
0.95
0.95
0.35
0.37
0.36
0.89
0.85
0.87
40%/60%
0.81
0.90
0.85
0.90
0.95
0.92
0.43
0.49
0.46
0.89
0.94
0.91
50%/50%
0.88
0.91
0.89
0.98
0.97
0.97
0.48
0.51
0.49
0.90
0.91
0.90
60%/40%
0.88
0.90
0.89
0.96
0.96
0.96
0.53
0.56
0.54
0.91
0.91
0.91
75%/25%
0.65
0.80
0.71
0.92
0.98
0.95
0.48
0.58
0.53
0.91
0.94
0.93
90%/10%
0.66
0.83
0.75
0.85
0.96
0.94
0.51
0.60
0.55
0.71
0.71
0.71
Investigation of a Spike Segment Neuron
353
6 Conclusions The paper considers approaches of application of a combination between convolutional neural network for image translation into a small-sized embedding space and a CSNM network in image recognition problem for an offline multi-object tracking. It shows a sufficiently high quality of matching to individual trajectories (from 20% to 53%, depending on the ratio of the training and test sets) and a very high quality of determining that example belong to a metaclass as a set of trajectories of one person (from 71% to 91%, depending on the ratios of train/test set). Results are very good for method (neuron) trained using only one example. Finally, size of training set near 30–40 examples by object is enough to train and classify objects in dataset up to 5000 samples using presented combination of siamese convolutional neural network and CSNM spiking neuron network. In the future, we plan to generalize the proposed training and integration methodology to more complex and extensive data sets (MOT, KITTI and BDD), expecting an increase in generalizing ability with the growth of the set. It is also planned to apply the approach to classify objects identified by classical methods within the framework of the task of the environment estimation task for a mobile robot using an RGB-D camera. Acknowledgements. The work was carried out as the part of the state task of the Russian Ministry of Education and Science for 2023 “Research of methods for creating self-learning video surveillance systems and video analytics based on the integration of technologies for spatiotemporal filtering of video stream and neural networks” (FNRG 2022 0015 1021060307687-9-1.2.1 №075-01595-23-00).
References 1. Nayak, R., Behera, M.M., et al.: Deep learning based loitering detection system using multi-camera video surveillance network. In: 2019 IEEE International Symposium on Smart Electronic Systems (iSES)(Formerly iNiS), pp. 215–220. IEEE (2019) 2. Ciaparrone, G., Sánchez, F.L., et al.: Deep learning in video multi-object tracking: a survey. Neurocomputing 381, 61–88 (2020) 3. Luo, W., Xing, J., et al.: Multiple object tracking: a literature review. Artif. Intell. 293, 103448 (2021) 4. Bakhshiev, A., Demcheva, A., Stankevich, L.: CSNM: the compartmental spiking neuron model for developing neuromorphic information processing systems. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y., Klimov, V.V. (eds.) NEUROINFORMATICS 2021. SCI, vol. 1008, pp. 327–333. Springer, Cham (2022). https://doi.org/10.1007/9783-030-91581-0_43 5. Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) Computer Vision. Lecture Notes in Computer Science, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/ 10.1007/978-3-319-48881-3_56 6. Kim, M., Alletto, S., Rigazio, L.: Similarity mapping with enhanced Siamese net-work for multi-object tracking. Preprint arXiv:1609.09156 [cs] (2017)
354
I. Fomin et al.
7. Wang, B., Wang, L., et al.: Joint learning of convolutional neural networks and temporally constrained metrics for tracklet association. In: Proceedings of the IEEE CVPR Workshops, pp. 1–8 (2016) 8. Zhang, S., et al.: Tracking persons-of-interest via adaptive discriminative features. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 415–433. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_26 9. Son, J., Baek, M., Cho, M., Han, B.: Multi-object tracking with quadruplet convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5620–5629 (2017) 10. Zhu, J., Yang, H., Liu, N., Kim, M., Zhang, W., Yang, M.-H.: Online multi-object tracking with dual matching attention networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 379–396. Springer, Cham (2018). https://doi. org/10.1007/978-3-030-01228-1_23 11. Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. Preprint arXiv:1703.07737 [cs] (2017) 12. Zhou, H., Ouyang, W., et al.: Deep continuous conditional random fields with asymmetric inter-object constraints for online multi-object tracking. IEEE Trans. Circuits Syst. Video Technol. 29(4), 1011–1022 (2018) 13. Chen, L., Ai, H., Zhuang, Z., Shang, C.: Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In: 2018 IEEE International Conference on Multimedia and Expo (ICME) 2018 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2018) 14. Dai, J., Li, Y., He, K., Sun, J., Fcn, R.: Object detection via region-based fully convolutional networks. arXiv preprint arXiv:1605.06409 (2016) 15. Boutros, F., Damer, N., Kirchbuchner, F., Kuijper, A.: Self-restrained triplet loss for accurate masked face recognition. Pattern Recogn. 124, 108473 (2022) 16. Ye, H., Liu, H., Meng, F., Li, X.: Bi-directional exponential angular triplet loss for RGBinfrared person re-identification. IEEE Trans. Image Process. 30, 1583–1595 (2020) 17. He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 18. Lin, T.Y., et al.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Realization of Super-Resolution Using Bicubic Interpolation and an Efficient Subpixel Model for Preprocessing Low Spatial Resolution Microscopic Images of Sputum I. G. Shelomentseva(B) Krasnoyarsk State Medical University Named After Professor V.F. Voino-Yasenetsky, 1, Partizan Zheleznyak Ave., Krasnoyrsk 660022, Russia [email protected]
Abstract. Medical imaging explores methods and models for analyzing medical image data, however, the low resolution of images obtained using equipment with small lenses and a short focal length may limit the implementation of medical data recognition. A variety of models and methods of super resolution implement the preprocessing of low spatial resolution images in medical imaging. The paper investigates the problem of preprocessing microscopic images of sputum containing small-sized objects of interest using super-resolution methods of bicubic interpolation and a model of an effective sub-pixel convolutional neural network. The performance of the selected models and methods is evaluated using the PSNR criterion. The obtained results show that both approaches can be used for the problem of super resolution of microscopic images of sputum containing small objects of interest. Keywords: super-resolution · ESPCN · bicubic interpolation · microscopic image
1 Introduction Super-resolution in the theory of pattern recognition is understood as the task of increasing the resolution of the image under study. This task is relevant for the visualization of microscopic images, satellite imaging, and video stream processing tasks, i.e. for those situations when the quality of the source material may be quite low or when it is necessary to solve the problem of finding a balance between the resolution of the result and the processing time, for example, when imaging MRI images [9]. The types of superresolution include single image super-resolution (SISR), multi-images super-resolution (MISR), video stream super-resolution, super-resolution stereo image, super-resolution satellite imagery [6]. Super-resolution is a non-trivial task since there may be several solutions for the same image. And to reduce the search space, it is necessary to provide reliable a priori information. When the zoom factor increases, the probability of reproducing incorrect © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 355–361, 2023. https://doi.org/10.1007/978-3-031-44865-2_38
356
I. G. Shelomentseva
information increases, and the processing time increases. Another disadvantage of superresolution is the use of PSNR and SSIM criteria to assess the quality of image resolution restoration since these criteria weakly correlate with the visual assessment of the image by a person [9]. If we designate the low-resolution image as LR, the high-resolution image as HR, and the predictive image as SR, then the super resolution problem can be described by formula (1) [11] SR = D−1 (LR, θ )
(1)
where D−1 is the inverse function to the degradation function, and θ is the parameters of this function. The degradation function itself is unknown, it can be affected by the processes of image noise, blurring, scaling, etc. Therefore, most researchers consider the process of super-resolution as a search for an inverse function, which implies a gradual degradation of the desired HR (formula 2) [9]. LR = (HR ⊕ k)ds + σ
(2)
where ⊕ is the convolution operation with blur kernel k, d_s is the downsampling operation, σ is the standard deviation of the additive white Gaussian noise. There are three main approaches to the problem of super–resolution - superresolution using interpolation (missing pixels are interpolated using existing pixels), super-resolution using reconstruction (a priori knowledge of the image is used for this), and super-resolution using machine learning (here it is necessary to prepare training datasets containing pairs of low and high-resolution images). The problem of single-image super-resolution using machine learning today is also represented by several approaches and methods of its implementation. Early models such as SRCN, VDSR are based on bicubic interpolation and pre-sampling technologies [4]. Post-upsampling models (ESPCN, FSRCNN) extract features in low-resolution space to speed up the learning process and reduce computations, and the sampling is performed in late layers using sub-pixel convolution technology [2]. Residual networks (EDSR, MDSR, CARN, BTSRN) modify the ResNet architecture to solve the super-resolution problem. The early residual network approach removed part of the layers (batch normalization and ReLU activation) from the residual blocks to improve performance, the modern approach uses multi-stage residual networks based on the principle of encoderdecoder operation [5]. Recursive networks (DRCN, DRRN) investigate the problem of obtaining an image with an increased resolution coefficient by using recursively connected blocks or convolutional layers [9]. Attention-based networks (SelNet, RCAN) offer special blocks that filter the results based on their impact on the quality of superresolution using a set of layers and filters that implement the functionality of bottleneck [12]. Generative models have also been used for the problem of super-resolution (GAN, EnhanceNet, ESRGAN). The aim of the current study was to investigate the applicability of super-resolution technologies to the problem of classifying low-resolution images containing small-sized objects of interest. When solving the classification problem, it is necessary to look for a compromise between the quality obtained and the preprocessing time. To achieve this compromise, I chose the bicubic interpolation method and the ESPCN (Efficient Sub-Pixel Convolutional Neural Network) model as the main
Realization of Super-Resolution Using Bicubic Interpolation
357
technologies for solving the problem of super resolution of low spatial resolution microscopic sputum images. This model can operate in real-time and still provide efficient 1080p video processing on a single GPU [7].
2 Materials and Methods The experimental set of images consists of microscopic images of sputum stained by the Ziehl-Neelsen method, obtained at a microscope magnification of 600x and a minimum resolution of a digital camera of 0.3 MP. The average area of Mycobacterium tuberculosis in the experimental images is 140 pixels, and the ratio of the dimension of the region of interest to the image resolution is 1.7*10–4 . Bicubic interpolation approximates the surface using a third-order polynomial based on the values of derivatives and intensities of four pixels (formulas 3–5, Fig. 1) [3]. Since bicubic interpolation causes overshoot and redistribution of intensity, the spatial resolution of the image increases. p(x, y) = aij xi yj
(3)
Pi = Pi1 × Wr1 (dx) + Pi2 × Wr2 (dx) + Pi3 × Wr3 (dx) + Pi4 × Wr4 (dx)
(4)
P = P1 × Wc1 (dy) + P2 × Wc2 (dy) + P3 × Wc3 (dy) + P4 × Wc4 (dy)
(5)
where Wri i Wcj are the coefficients of the i-row and j-column, calculated through the interpolation kernel of a cubic convolution.
Fig. 1. Neighborhood of pixel P in two-dimensional space
The ESPCN super-resolution model uses the concept of sub-pixels, which are convolved on the last layer of the model. It is considered that the smallest unit in imaging is the pixel, which contains the bytes of red, blue, and green (in the RGB model), the color is formed by mixing [1]. The distance between pixels is measured in microns (which corresponds to the dimension of a digital camera chip), but at the micro level, there are also infinitely smaller things between pixels, called sub-pixels. Part of the machine vision tasks considers not only pixels but also sub-pixels, for example, the ESPCN model for the super-resolution task. Feature extraction in the ESPCN model occurs on the first layers, and the subpixel convolution layer is used to aggregate the extracted features and reconstruct the HR
358
I. G. Shelomentseva
image [7]. The sub-pixel layer is like the deconvolution layer and uses additional step splitting to increase the spatial resolution (the H × W × C × r2 tensor is converted to the rH × rW × C tensor) – Fig. 2. To reduce computing power, when using the ESPCN model, the input images are converted to the YCbCr color space, then a channel is transmitted to the CNN input Y. Efficiency is also achieved since a person’s visual perception is more focused on brightness than on color [8].
Fig. 2. Diagram of the ESPCN model
Figure 3a shows the ESPCN architecture proposed by Wenzhe Shi et al. [7] (architecture A). Figure 3b shows a modification of the classical ESPCN architecture by Xingyu Long [10] for image classification of the BSDS500 dataset (architecture B). Figure 3c shows the ESPCN architecture proposed by William Symolon et al. [8] (architecture C).
Fig. 3. Architecture of the ESPCN model, where B – batch size, C – channel size, N – image size, and r – upscale factor.
Realization of Super-Resolution Using Bicubic Interpolation
359
3 Computational Experiment The computational experiment is represented by a series of studies with three types of ESPCN architectures (Fig. 3) and bicubic interpolation. Convolutional networks were trained with different hyperparameters, optimizers, scaling factors, and using regularization (Table 1). The number of training epochs was 200, the standard error MSE was used as a loss function, and the PSNR criterion was used to measure the quality of the super-resolution. Table 1. Model parameter variations Learning Rate
Optimizer
Dropout Regularization
Activation Functions
Upscale factor
0.01
Adam
None
Sigmoid
2
0.001
RMSprop
10%
Tahn
3
0.0001
Adagrad
30%
Relu
4
Table 2 and Table 3 presents the results of the computational experiment. Bicubic interpolation with an upscale factor equal to 2 showed the best results on microscopic images of sputum stained by the Ziehl-Neelsen method, obtained at a microscope magnification of 600x and a minimum resolution of a digital camera of 0.3 MP. The modification of the classical architecture proposed by Xingyu Long showed the best results among the ESPCN models with the following parameters: optimizer = Adam, dropout = None, activation = sigmoid, learning rate = 0.001, and upscale factor = 2. Table 2. Results of the computational experiment with super-resolution using ESPCN upscale factor 2 PSNR (M ± σ)
upscale factor 3 PSNR (M ± σ)
upscale factor 4 PSNR (M ± σ)
Architecture A ESPCN
38,89 ± 0,06
36,84 ± 0,57
36,12 ± 0,30
Architecture B ESPCN
39,76 ± 1,19
35,19 ± 1,57
35,34 ± 1,29
Architecture C ESPCN
39,34 ± 1,28
36,61 ± 0,86
35,54 ± 0,12
Table 3. Results of the computational experiment with super-resolution using bicubic interpolation
Bicubic
upscale factor 2 PSNR
upscale factor 3 PSNR
upscale factor 4 PSNR
41,10
36,30
36,93
The set of experimental images contains small-sized objects of interest, so it was important not only to calculate the value of the PSNR criterion but to make sure that
360
I. G. Shelomentseva
reconstructive preprocessing using super-resolution methods and models would preserve the classification quality. The author used transfer learning based on the VGG19 model with 30 epochs to test the quality of the super-resolution reconstruction. The author also used three sets of images as experimental data: the original set, the original set preprocessed using bicubic interpolation with an upscale factor equal to 2, the original set preprocessed using the Architecture B of ESPCN model with an upscale factor equal to 2. The parameters of this experiment there was a binary classification, loss = categorical crossentropy, optimizer = adam, and metrics = accuracy. Table 4 presents the results of checking the quality of classification using super-resolution. Table 4. The results of checking the quality of classification using super-resolution Accuracy Loss f1-score original set
94,71
0,14
0,95
the original set with super-resolution using bicubic interpolation (upscale factor = 2)
95,63
0,11
0,96
the original set with super-resolution using the Architecture B of 96,11 ESPCN model (upscale factor = 2)
0,10
0,96
4 Conclusion The article compared the performance of super-resolution methods for the problem of classification of microscopic sputum images of the low-spatial resolution containing small-sized objects of interest. Bicubic interpolation with an upscale factor equal to 2 showed the best results both in terms of the preprocessing speed (since this method does not require pre-training) and according to the PSNR criterion for the task of super-resolution of microscopic images of sputum stained using the Ziehl-Neelsen method obtained with a 10x60 microscope magnification and a minimum resolution of 0.3 MP digital camera. The results of the super-resolution were also verified with the binary classification problem. This is important because the reconstruction process of the original dataset can affect small image details, which can have a significant impact on the result of medical diagnostics. The binary classification accuracy with B ESPCN architecture was the highest, i.e. the result of super-resolution of the objects of interest turned out to be more robust to the result of the classification. The results of the PSNR criterion provided an improvement in the resolution of the original set of sputum microscopy images stained by the Ziehl-Neelsen method containing small-sized objects of interest, and as a result, improved binary classification results, which is a good result from the point of view of preserving small-sized objects of interest when using super-resolution technology. However, improvements in both the quality of spatial resolution and the results of the final classification were not significant, so this study is only a step towards the application of super-resolution for the classification of microscopic images of low spatial resolution.
Realization of Super-Resolution Using Bicubic Interpolation
361
References 1. Aitken, A., Ledig, C., Theis, L., Caballero, J., Wang, Z., Shi, W.: Checkerboard artifact-free sub-pixel convolution: a note on sub-pixel convolution, resize convolution and convolution resize. https://arxiv.org/abs/1707.02937. Accessed 25 May 2023 2. Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image superresolution. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 184–199. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_13 3. Khaledyan, D., Amirany, A., Jafari, K., Moaiyeri, M., Zargari, A., Mashhadi, N.: Lowcost implementation of bilinear and bicubic image interpolation for real-time image super-resolution. https://arxiv.org/abs/2009.09622. Accessed 25 May 2023 4. Kim, J., Lee, J.K., Lee, K.M.: Accurate image super-resolution using very deep convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Las Vegas, NV, USA, pp. 1646–1654 (2016) 5. Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M.: Enhanced deep residual networks for single image super-resolution. In: CVPRW, pp. 1132–1140 (2017) 6. Liu, H., et al.: Video super-resolution based on deep learning: a comprehensive survey. Artif. Intell. Rev. 55, 5981–6035 (2022). https://doi.org/10.1007/s10462-022-10147-y 7. Shi, W., et al.: Real-time single image and video super-resolution using an efficient subpixel convolutional neural network. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Las Vegas, NV, USA, pp. 1874–1883 (2016) 8. Symolon, W., Dagli, C.: Single-image super resolution using convolutional neural network, procedia computer science. Procedia Comput. Sci. 185, 213–222 (2021) 9. Wang, Z., Chen, J., Hoi, S.C.H.: Deep learning for image super-resolution: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3365–3387 (2021) 10. Xingyu, L.: Image Super-resolution using an efficient sub-pixel CNN. https://keras.io/exa mples/vision/super_resolution_sub_pixel/. Accessed 25 May 2023 11. Zhang, K., Wangmeng, Z., Zhang, L.: Learning a single convolutional super-resolution network for multiple degradations. In: Proceedings CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2018) 12. Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Yun, F.: Image super-resolution using very deep residual channel attention networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 294–310. Springer, Cham (2018). https://doi. org/10.1007/978-3-030-01234-2_18
An Intelligent Day Ahead Solar Plant’s Power Forecasting System Ekaterina A. Engel(B) and Nikita E. Engel Katanov State University of Khakassia, Shetinkina 61, 655017 Abakan, RF, Russia [email protected]
Abstract. The power production of a photovoltaic system has complex nonlinear dynamic with uncertainties since solar radiation and temperature fluctuate. Thereby, it is complicated to approximate this complex dynamic by conventional algorithms while machine learning algorithms provide the required forecast’s performance. We solved the intelligent day ahead solar plant’s power forecasting task based on the modified neural net with a developed fuzzy attention mechanism. In contrast with existed fuzzy attention mechanisms which use classical membership function we consider an attention mechanism’s context vector as fuzzy measures. The developed fuzzy attention mechanism selects from all signals which provided by classical attention mechanism only important signal based on fuzzy measures and the fuzzy integral. The comparison simulations study results of the intelligent day ahead solar plant’s power forecasting system based on the created modified neural net with a fuzzy attention mechanism reveal its advantages and competitive performance, as compared to a classical RNN. The modified neural net with a fuzzy attention mechanism has competitive performance as compared to a modified fuzzy neuronet and RNN for day ahead forecasting of the hourly solar plants’ power. Keywords: modified neural net · fuzzy attention mechanism · solar plants’ power forecasting · uncertainties
1 Introduction The power production of a photovoltaic (PV) system has complex nonlinear dynamic with uncertainties since solar radiation and temperature fluctuate [1]. Thereby, it is complicated to approximate this complex dynamic by conventional algorithms while machine learning (ML) algorithms provide the required forecast’s performance [2]. The power forecasting of a PV system is very important for safety and effectiveness of electric grid. If a deviation from an hourly plan schedule of PV system’s power arises then the energy market charges penalties. In the resent researchers use ML methods to increase the performance of forecasting solar plant’s power models under uncertainties [3–25], as compared to conventional algorithms. The effective data preprocessing methods provide the ML model’s better performance [4]. The dataset’s length has positive correlation with a forecast’s performance © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 362–369, 2023. https://doi.org/10.1007/978-3-031-44865-2_39
An Intelligent Day Ahead Solar Plant’s Power Forecasting System
363
[4]. There are many ML algorithms and their variants for PV array power forecasting [3–25]. However, most researchers trained these networks of large size without regard to solar plant power forecasting task’s complexity. That can cause overfitting problems, with devastating effects on the generalization performance [26]. In recent researches, Transformers extremely successfully solved many deep learning tasks. The core element of Transformers is self-attention that scales as O(N2 ) with a sequence length, and as a result, it is severely limited in application to long sequences [27]. According to paper [27] the replacement of a feed-forward layer by Memory Layers improves Transformer capacity for a negligible computational cost. The study [27] presented the Transformer to store non-local representations, and creating memory bottleneck for the global information. In contrast with presented in papers [28, 29] fuzzy attention mechanisms which use classical membership function we consider an attention mechanism’s context vector as fuzzy measures. The developed fuzzy attention mechanism selects from all signals which provided by classical attention mechanism [30–32] only important signal based on fuzzy measures and the fuzzy integral. In our paper [3] we presented an intelligent times series forecasting framework based on the authors software [33] that automatically create an optimum architecture of the modified fuzzy neural net (MFNN) with regards to a forecasting task’s complexity. In order to increase MFNN’s forecasting performance we develop and integrate into MFNN a fuzzy attention mechanism which verified by solving the solar plant’s power forecasting task. We solved the intelligent day ahead solar plant’s power forecasting task based on a modified fuzzy neural net with a fuzzy attention mechanism (MFANN). The simulations study results of the intelligent day ahead solar plant’s power forecasting system based on the created MFANN reveal its advantages and competitive performance, as compared to MFNN and a classical RNN.
2 A Day Ahead Hourly Solar Plant Power Forecasting Based on the MFANN The MFANN includes: a two-layer RNN, fuzzy rules and two two-layered RNNs. There are mainly three types of input signals to forecast of a solar plant power based on ML methods: • only historical output power, • forecasted meteorological parameters, • the historical power data and forecasted meteorological parameters. In our research we elaborate last type of input signal. We create the MFANN for day ahead hourly solar plant power forecasting based on the data t−m Xht = Cht−2−m , lht−m , aht−m , Pht−2−m , Rt−m (1) , dht−m , h , Wh where C h t−2−m is the historical data of clear-sky index, lh t−m is the cloudiness (%), ah t−m is the ambient temperature; Ph t−2−m is the generated power from a solar plant;
364
E. A. Engel and N. E. Engel
Rh t−m is the pressure, W h t−m and d h t−m are the wind speed and the wind direction, respectively, h is the hour, h ∈ 1..19; t is day’s index, t ∈ {370, .., 1095}; m is index to provide sample X h t , m = 0..14, 361..369. The data (1) was collected at the Abakan solar plant. We defined a fitness function as follows: N 1 (P t − P xt )2 , (2) RMSE = t N −t+1 where P xt is the value which generated based on MFANN with architecture x; N = 1003 and t = 370 for train data set (1); N = 1095 and t = 1004 for test data set (1). We defined RMSE% for test data set (1) as follows: t P − P xt 2 1 1095 (3) RMSE% = t=1004 92 Pt We briefly described the creation of the MFANN for day ahead hourly solar plant power forecasting as follows. Step 1. All samples of the data (1) were classified to
into two groups according
hour’s state of cloudiness: A1 is sunny hour C t = 1 , A2 is cloudy hour C t = −1 . This classification generates vector with elements C t . Step 2. We created the two-layer RNN with a developed fuzzy attention mechanism: t Y(X h t ). The vector X h t was the network’s t input. The vector C was the network’s target. We formed membership function μj Xh based on the output – n of the two-layer RNN Y(X h t ) as follows
If n > 0 then μ1 Xht = n, μ2 Xht = 1 − μ1 Xht else μ2 Xht = abs(n), μ1 Xht j ∈ {1, 2}. = 1 − μ2 Xht , Step 3. The MFANN includes RNN Y(X h t ) and RNNs fj = Fj Pht−2−m . Step 4. We applied a developed fuzzy attention mechanism by calculating a fuzzy expected solution (Fes) – I h based on a vector hidden layer’s signals of RNN Y(X h t ) – w ν h ] in 3 sub steps: and a context vector – [ν 1 , …, q Step 4.1: Solve equation i=1 (1 + λwi ) − 1 /λ
< ∞.
= 1, −1 < λ Step 4.2: Calculate s = ∫ h ◦ Wλ = sup min α, Wλ Fα vj , where Fα vj = α∈[0,1]
k Fi |Fi , vj ≥ α , vj ∈ V ,Wλ Fα vj = Fi ∈Fα (vj ) (1 + λwi ) − 1 /λ.
Step 4.3: Calculate Ih = max s wj / v ∈V i We give the RNNs Fj Pht−2−m on a hidden layer an extra signal Ih . This signal will give useful feedback for providing the day ahead hourly power forecasting value during the dynamically changing times series (1). Step 5. The modified multidimensional quantum-behaved PSO algorithm (MD QPSO), which we developed in [3] generates the trained MFANN – best solution x based on the data (1). We used function (2) as a fitness function for the modified MD QPSO.
An Intelligent Day Ahead Solar Plant’s Power Forecasting System
365
Step 6. If-then rules are defined as: j : IF Xht is Aj THEN u = fj , j = 1..2.
(4)
We briefly described the simulation of the trained MFANN as follows. Step 1. Aggregation antecedents of the rules (4) maps Xht into their membership functions and activates the k rule, which indicates the k hour’s state of cloudiness and correspondent k RNN Fk Pht−2−m , k ∈ 1..2. Step 2. RNN Fk Pht−2−m generates a day ahead hourly power forecasting value u. The developed fuzzy attention mechanism selects from all signals which provided by classical attention mechanism only important signal based on fuzzy measures and the fuzzy integral. This forecasting approach does provide a more intelligent algorithm of generating a day ahead hourly forecasting power – u based on a MFANN.
3 Simulation and Results To illustrate the benefits of the MFANN in a day ahead forecasting hourly power from the Abakan solar plant, we revisited the numerical examples from the previous Sects. 2 based on the authors software [3, 33]. We created the MFNN and MFANN based on modified MD QPSO. As compared to MFNN the MFANN includes a developed fuzzy attention mechanism. The modified MD QPSO based on the training data set (1) from 09/16 through 06/19 created the optimum architecture of the MFANN and MFNN that includes the two-layered RNN Y(X h t ) with five hidden neurons (number of delays is two) and then the two-layered RNNs f with seven and ten hidden neurons, correspondently (number of delays is two). Figure 1 represents the plot of the actual generated power from 08/17/19 through 08/30/19 in comparison to forecasted power based on the MFANN and the RNN (two-layered, the number of hidden neurons and delays are 10 and 2), which trained by the Levenberg-Marquardt algorithm based on training data (1). Figure 1 demonstrates that fitness function (1) of the MFANN in sunny hours is quite small as compared with the RNN, which we generated as two-layered based on data (1) with ten hidden neurons, the number of delays is two. Figure 1 demonstrates that the performance of the MFANN is changing in sunny and cloudy hours. Nevertheless, the MFANN effectively tracks the complex dynamics of real measured data in cloudy hours. The performance of the MFANN trained by the proposed algorithm is superior to the RNN trained by the Levenberg-Marquardt algorithm, especially on cloudiness condition.
366
E. A. Engel and N. E. Engel
Fig. 1. The actual generated power from the Abakan solar plant in comparison to forecasted power based on the MFANN and the RNN
Table 1 indicates that the MFANN have competitive performance on the test data set (1) as compared to MFNN and a classical RNN for day ahead forecasting of the hourly solar plants’ power. Table 1. A day ahead forecasting of the hourly solar plants’ power: comparison of results Predicting Method
Dataset’s length
RMSE (Wh/m2)
RNN-LSTM [13]
4 years
26.85
–
XGBoost-DNN [14]
10 years
51.35
–
DPNN [15]
2 weeks
52.8
–
K-means-AE-CNNLSTM [16]
-
45.11
–
LSTM-RNN [17]
1 year
82.15
–
LSTM [18]
-
139.3
–
MLPNN [19]
1 year
160.3
–
TDNN + clustering [19]
1 year
122
–
MLFFNN based on BP [19]
1 year
223
–
CNN-Simple [20]
6 years
51
–
Multi-headed CNN [20]
6 years
81
–
CNN-LSTM [20]
6 years
51
–
60
–
15 months
–
8.69%
D-PNN [21] CNN [22]
RMSE%
(continued)
An Intelligent Day Ahead Solar Plant’s Power Forecasting System
367
Table 1. (continued) Predicting Method
Dataset’s length
RMSE (Wh/m2)
RMSE%
LSTM [22]
15 months
–
7.56%
LSTM NN [23]
3 months
7.1
–
RNN [23]
3 months
9.2
–
Generalized regression neural network (GRNN) [23]
3 months
13.1
–
Transfer learning constrained LSTM (TL + C-LSTM) [24]
1 year
8.89
–
MFANN
3 years
28.53
8.25%
MFNN
3 years
35.24
12.31%
RNN
3 years
53.39
15.44%
The comparison simulations study results of the created MFANN for a day ahead solar plant’s power forecasting reveal its advantages and competitive performance, as compared to MFNN and a classical RNN.
4 Conclusions We solved the intelligent day ahead solar plant’s power forecasting task based on the MFANN that integrated a developed fuzzy attention mechanism. In contrast with existed fuzzy attention mechanisms which use classical membership function we consider an attention mechanism’s context vector as fuzzy measures. The developed fuzzy attention mechanism selects from all signals which provided by classical attention mechanism only important signal based on fuzzy measures and the fuzzy integral. The comparison simulations study results of the intelligent day ahead solar plant’s power forecasting system based on the created MFANN reveal its advantages and competitive performance, as compared to MFNN and a classical RNN. Acknowledgement. The reported study was fulfilled according the activity “Development of intelligent systems for forecasting and maximizing power generation based on the original modified fuzzy neural network, their implementation as software and the implementation at a renewable power plant” within program of the World-class Scientific Educational Center “Yenisei Siberia”.
Funding:. The study was funded by a grant from the Ministry of Education and Science of the Republic of Khakassia (Agreement No. 91 dated 12/13/22).
368
E. A. Engel and N. E. Engel
References 1. Engel, E., Kovalev, I., Testoyedov, N., Engel, N.E.: Intelligent Reconfigurable Photovoltaic System. Energies 2021(14), 7969 (2021). https://doi.org/10.3390/en14237969 2. Engel, E.A., Engel, N.E.: Automatic adaptation of a solar plant intelligent control system. J. Phys. Conf. Ser. 2, 022029 (2021) 3. Engel E.A., Engel N.E.: The intelligent times series forecasting framework. B cbopnike: Hepoinfopmatika-2022. Cbopnik nayqnyx tpydov XXIV Medynapodno nayqnotexniqecko konfepencii. Mockva, pp. 263–269 (2022) 4. Engel, E., Engel, N.: A review on machine learning applications for solar plants. Sensors 2022 22, 9060 (2022) 5. Chen, K., Chen, K., Wang, Q., He, Z., Hu, J., He, J.: Short-term load forecasting with deep residual networks. IEEE Trans. Smart Grid (2018) 6. Kuo, P.-H., Huang, C.-J.: A high precision artificial neural networks model for short-term energy load forecasting. Energies 11, 1 (2018). http://www.mdpi.com/1996-1073/11/1/213 7. Bianchi, F.M., Maiorino, E., Kampffmeyer, M.C., Rizzi, A., Jenssen, R.: An overview and comparative analysis of recurrent neural networks for short term load forecasting. CoRR, abs/1705.04378 (2017). http://arxiv.org/abs/1705.04378 8. Bianchi, F.M., Santis, E.D., Rizzi, A., Sadeghian, A.: Short-term electric load forecasting using echo state networks and PCA decomposition. IEEE Access 3, 1931–1943 (2015) 9. Bouktif, S., Fiaz, A., Ouni, A., Serhani, M.: Optimal deep learning LSTM model for electric load forecasting using feature selection and genetic algorithm: comparison with machine learning approaches. Energies 11(7), 1636 (2018) 10. Wang, Y., Liu, M., Bao, Z., Zhang, S.: Short-term load forecasting with multi-source data using gated recurrent unit neural networks. Energies 11, 1138 (2018) 11. Tavares, C.A.P., Leite, K.T.F., Suemitsu, W.I., Bellar, M.D.: Performance evaluation of PV solar system with different MPPT methods. In: IECON ‘09, 35th Annual Conference of IEEE, pp.719–724. Industrial Electronics (2009) 12. Kumar, A., Chaudhary, P., Rizwan, M.: Development of fuzzy logic based MPPT controller for PV system at varying meteorological parameters. In: 2015 Annual IEEE India Conference (INDICON), pp. 1–6, New Delhi (2015) 13. Akhter, M.N., et al.: An hour-ahead PV power forecasting method based on an RNN-LSTM model for three different PV plants. Energies 15, 2243 (2022) 14. Kumari, P., Toshniwal, D.: Extreme gradient boosting and deep neural network-based ensemble learning approach to forecast hourly solar irradiance. J. Clean. Prod. 279, 123285 (2021) 15. Zjavka, L.: PV power intra-day predictions using PDE models of polynomial networks based on operational calculus. IET Renew. Power Gener. 14, 1405–1412 (2020) 16. Zhen, Z., et al.: Deep learning based surface irradiance mapping model for solar PV power forecasting using sky image. IEEE Trans. Ind. Appl. 56, 3385–3396 (2020) 17. Abdel-Nasser, M., Mahmoud, K.: Accurate photovoltaic power forecasting models using deep LSTM-RNN. Neural Comput. Appl. 31, 2727–2740 (2019) 18. Zhang, J., Verschae, R., Nobuhara, S., Lalonde, J.-F.: Deep photovoltaic nowcasting. Sol. Energy 176, 267–276 (2018) 19. Akhter, M.N., Mekhilef, S., Mokhlis, H., Shah, N.M.: Review on forecasting of photovoltaic power generation based on machine learning and metaheuristic techniques. IET Renew. Power Gener. 13, 1009–1023 (2019) 20. Suresh, V., Janik, P., Rezmer, J., Leonowicz, Z.: Forecasting solar PV output using convolutional neural networks with a sliding window algorithm. Energies 13, 723 (2020)
An Intelligent Day Ahead Solar Plant’s Power Forecasting System
369
21. Zjavka, L., Snášel, V.: PV energy prediction in 24 h horizon using modular models based on polynomial conversion of the L-transform PDE derivatives in node-by-node-evolved binarytree networks. Eng. Proc. 18, 34 (2022) 22. Pombo, D.V., Bindner, H.W., Spataru, S.V., Sorensen, P.E., Bacher, P.: Increasing the accuracy of hourly multi-output solar power forecast with physics-informed machine learning. Sensors 22, 749 (2022) 23. Hossain, M.S., Mahmood, H.: Short-term photovoltaic power forecasting using an LSTM neural network and synthetic weather forecast. IEEE Access 8, 172524–172533 (2020) 24. Luo, X., Zhang, D., Zhu, X.: Combining transfer learning and constrained long short-term memory for power generation forecasting of newly-constructed photovoltaic plants. Renew. Energy 185, 1062–1077 (2022) 25. Engel, E.: A photovoltaic applications on the basis of modified fuzzy neural net solar irradiance. Types and applications, pp. 7–87. Nova Science Publishers (2020) 26. Baymurzina, D., Golikov, E., Burtsev, M.: A review of neural architecture search. Neurocomputing 474, 82–93 (2022) 27. COPOKIH A.., PUGAQEB L.P., BUPCEB M.C. Obyqenie dolgovpemenno pamti qepez ppedckazanie cobyti vycoko neoppedelennocti // TPUDY MFTI. TPUDY MOCKOBCKOGO FIZIKO-TEXHIQECKOGO IHCTITUTA (HACIOHALHOGO ICCLEDOBATELCKOGO UHIBEPCITETA, tom: 13, № 4 (52), 39–55 (2021) 28. Wang, C., Lv, X., Shao, M., Qian, Y., Zhang, Y.: A novel fuzzy hierarchical fusion attention convolution neural network for medical image super-resolution reconstruction. Inf. Sci. 622, 424–436 (2023) 29. Yang, R., Yu, J., Yin, J., et al.: An FA-SegNet image segmentation model based on fuzzy attention and its application in cardiac MRI segmentation. Int. J. Comput. Intell. Syst. 15, 24 (2022). https://doi.org/10.1007/s44196-022-00080-x 30. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) 31. Sagirova, A., Burtsev, M.: Extending transformer decoder with working memory for sequence to sequence tasks. Stud. Comput. Intell. 253–260 (2021) 32. Al Adel, A., Burtsev, M.S.: Memory transformer with hierarchical attention for long document processing. In: 2021 International Conference Engineering and Telecommunication (En&T) (2021) 33. The module of the modified fuzzy neural net. M.: Federal Service for Intellectual Property (Rospatent), Certificate about State registration of software №. 2021681065 (2021)
Determining the Significance of Input Features in Predicting Magnetic Storms Using Machine Learning Methods Roman Vladimirov(B) , Vladimir Shirokiy , Oleg Barinov, and Irina Myagkova D.V. Skobeltsyn Institute of Nuclear Physics, M.V. Lomonosov Moscow State University, Moscow, Russia [email protected], {shiroky,irina}@srd.sinp.msu.ru
Abstract. In this paper, we study an algorithm for obtaining the most efficient model for predicting the amplitude of the geomagnetic Dst index, based on lowering the input data dimension by gradually discarding input features. This task is relevant, since the selection of significant input data is necessary for the effective use of machine learning methods. The study was carried out on the basis of the following machine learning methods: artificial neural network of the multilayer perceptron type, gradient boosting, linear regression. Comparison of the effectiveness of the listed methods is carried out. Keywords: geomagnetic Dst index · geomagnetic storms · neural networks · gradient boosting · ridge regression · random forest
1 Introduction Magnetic storms are one of the most important factors in space weather. The influence of space weather on the functioning of electronic devices on Earth and in space will increase with the development of the global digital industry and the miniaturization of electronics [1]. Geomagnetic storms are known to cause disruption of radio communications and to complicate the operation of power lines, electrical networks, and pipelines [2]. Therefore, forecasting geomagnetic storms is an important practical task. The Dst index (Disturbance Storm Time) is an important geomagnetic index. It is a measure of magnetic field change due to ring currents that occur in the Earth’s magnetosphere during geomagnetic storms. This index was introduced by M. Sugiura in 1964 [3]. We have data on the Dst index since the beginning of 1957. The Dst index is calculated as the average hourly value of the perturbation of the horizontal component of the Earth’s magnetic field, measured from the quiet level, which is determined from the data of four low-latitude observatories located at different geographical longitudes: Kakioka, Honolulu, San Juan, and Germanus [3]. Dst-index is calculated as a function of time, once per hour. Thus, the prediction of the Dst amplitude makes it possible to estimate both the onset time and the power of a geomagnetic disturbance. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 370–379, 2023. https://doi.org/10.1007/978-3-031-44865-2_40
Determining the Significance of Input Features
371
The fact that we have data on the Dst index for a long time makes possible using for its prediction the statistical relationship between processes on the Sun, in the solar wind (SW), interplanetary space, and the Earth’s magnetosphere, on one side, and geomagnetic activity, on the other side, based on the Burton formula [4], as well as application of machine learning (ML) methods (e.g., [5–9]). In our previous study [10], we compared the quality indicators of Dst prediction with a horizon of one to six hours, performed using three ML methods – random forest, gradient boosting, and multilayer perceptron artificial neural networks. The best results among the three methods listed above were obtained using gradient boosting. However, one of the problems of ML methods is the large dimension of the input data, which was about one hundred and thirty. Such a high dimension is due to the fact that for all input physical quantities, which will be described below, delay embedding was used – all their previous values with a delay from 1 to 24 h were taken into account. Meanwhile, in multidimensional spaces, the measure concentration effect takes place, one of the consequences of which is the fact that a small neighborhood of the median level of any function continuous on a multidimensional sphere contains almost the entire sphere. Therefore, from the point of view of an observer who measures the values of this function, it turns out to be practically constant [11]. In other words, any non-linear multivariate predictor should give a forecast close to the inertial one, which in the context of time series forecasting is often called “tomorrow as today”. On the other hand, there is hope that with a decrease in the dimension of the space of input features and the associated weakening of the effect of measure concentration, the quality of the forecast performed using ML will increase in comparison with the inertial (trivial) one. Given this fact, it becomes obvious that when solving the forecasting problem, reducing the dimension of the input data is very important. In addition to the actual reduction in dimensionality, the analysis of the set of features selected as the most significant ones allows researchers to obtain information both about the relative significance of certain physical processes on the Sun and in interplanetary space, and about the optimal embedding depth of the input data. The objective of the presented study is comparative analysis of the results of applying the algorithm for ranking input features by gradually discarding them using the following ML methods: linear regression, gradient boosting, and multilayer perceptron artificial neural network (ANN). The selected set of the most significant input features was used by the authors to obtain the most efficient model that predicts the Dst index with a horizon of one to six hours.
2 Input Data The processes in the Earth’s magnetosphere, the solar wind (SW) and the heliosphere are interconnected, so the space weather forecast in the “Sun - heliosphere - solar wind – magnetosphere” chain is made based on measurement data carried out both in space experiments and by ground geophysical stations. For an operational forecast, experimental data must be obtained in real time – information is required on the values of the SW parameters, the interplanetary magnetic field (IMF), and the geomagnetic index Dst itself. The input data for the Dst index prediction are the SW and IMF plasma parameters
372
R. Vladimirov et al.
measured at the Lagrange point L1 between the Sun and the Earth in the experiment aboard the ACE (Advanced Composition Explorer) spacecraft (http://www.srl.caltech. edu/ACE), as well as the values of Dst from the website of the World Data Center for Geomagnetism in Kyoto (https://wdc.kugi.kyoto-u.ac.jp/dstae/index.html). In this study, we used the following data – time series (TS) of average hourly values of the following physical quantities: SW and IMF parameters measured at the Lagrange point L1 between the Sun and Earth – SW velocity and density V (km/s) and Np (cm−3 ), respectively; components of the IMF vector at the same Lagrange point L1 in the GSM system – By , Bz (y- and z-components of the IMF), and the IMF module Bmagn (nT); the geomagnetic index Dst itself (nT). In addition, to take into account possible daily and annual variations of the Dst-index, the values characterizing the binding of each data pattern to certain phases of the daily and annual cycles, i.e., the values of sines and cosines with daily and annual periods, were used as input features [9]. To take into account the prehistory, the delay embedding of all TSs for up to 23 h was used, that is, the input of the algorithm, in addition to the current values of all input quantities, was fed with their previous values one, two, three, …, twenty-three hours before the current one. This seems enough, working with data with an hourly temporal resolution. The total input data dimension obtained was 148 = 6*24 + 4. In this study, we used a dataset from the moment when data began to arrive from the ACE spacecraft (November 1997) to December 2021 inclusive. This makes about 212,000 patterns. Data gaps of 12 h or less in each input feature were filled in by linear interpolation between the two points adjacent to the gap from both sides; patterns still containing gaps after this procedure were excluded from the model training dataset. The resulting data set was divided into the training sample and the test set of data. The training sample was used to train the algorithms – to adapt the adjustable parameters of the models; the test set was used to evaluate the training result on independent data. For the ANN, the training sample was further divided into training and validation datasets. The training set was used to adjust the weights when training the ANN, while the validation set was used for periodic checks during the training process to prevent overtraining of the ANN. In this study, the training and validation sets used the data from November 1997 to December 2016 inclusive divided randomly in a ratio of 80 percent to 20 percent respectively, and the test set consisted of data from 2017 to 2021.
3 Selection of Significant Input Features: Description of the Algorithm The significance of input features was assessed using the algorithm shown in Fig. 1. This algorithm requires a basic forecasting model that takes as inputs any number of features from 1 to N (plus 4 harmonic functions), and the training data set. The algorithm consists of the following steps: First, the base model using all N features is trained, and its performance is evaluated. This model serves as a reference, along with the trivial inertial model (predicted value = last known value). Second, the input features are excluded one by one. N models are trained with N – 1 input features each. The quality of each model is evaluated.
Determining the Significance of Input Features
373
Next, the feature that leads to the highest forecast accuracy model when removed is considered the least important one, and it is permanently dropped from the input dataset. Then, if more than one feature remains in the input data, N = N – 1 is assigned, and the process goes back to the second step.
Fig. 1. The general scheme of the algorithm for assessing the significance of input features.
The algorithm described ranks all the input features in the order of being discarded, i.e. in the ascending order of significance. The set of input features that corresponds to the best indicator of quality of the model built on it is considered the optimal one. In this paper, the algorithm was used with two types of the basic forecasting model: linear regression (LR) and gradient boosting (GB). The computational cost of LR is considerably lower, but its linearity can result in lower prediction quality. The GB algorithm is a combination of simple models. The standard approach here is to use shallow decision trees. Each tree (in accordance with the boosting approach) should improve the prediction of the preceding one, and the coefficients of the models are determined by the gradient descent algorithm. The decision tree model is a piecewise linear approximator, and it is capable to describe non-linear approximated functions with the desired accuracy, if both the size and the representativity of the training set and the algorithm parameters, in particular the depth of the constructed decision graph, allow this. The increase of the depth of the trees provides a smoother approximation. When ensembling, a set of such decision trees is built, and each subsequent tree allows
374
R. Vladimirov et al.
approximating an increasingly complex and non-linear function. That is why the GB algorithm as a whole is a non-linear (piecewise linear) approximation algorithm, despite the linearity of a single tree node. The multilayer perceptron (MLP) was not included as a fundamental algorithm in this study because constructing each model with it would be much more computationally expensive. Full implementation of the selection algorithm for MLP would require approximately two months of continuous computing on the available computing power, which is significantly more time-consuming compared to the computational cost of the GB algorithm. Thus, the described algorithm was used to obtain two best sets of input features: one selected with the help of LR (the one that resulted in obtaining the best LR model) and the other selected with the help of GB (the one that resulted in obtaining the best GB model). The third best set of input features was obtained by considering sequences of sets of input features obtained using LR and GB. We explored sets ranging from 3 to 31 input features for both LR and GB separately. On each of these sets, 5 identical MLPs were trained differing by weight initializations; the output values of these 5 MLPs were averaged. The set of input features for which such a peer committee of 5 networks gave the best quality score was considered the best one for MLP. It was this set of features that were considered to be most physically significant in the analysis of the selected features (see below), because MLP, due to its nonlinearity and the known property of universal approximation, usually tends to select fewer features compared to LR or GB. Thus, the performance of the following models was compared for each forecasting horizon ranging from one to six hours: • • • •
Trivial inertial model; LR on the best feature set for LR selected using the algorithm (Fig. 1); GB on the best feature set for GB selected using the algorithm (Fig. 1); MLP on the best feature set for MLP selected from the sets obtained in the course of the algorithm (Fig. 1) for LR and GB using the procedure described above.
4 Results To estimate the quality of the obtained predictive models, we used the root mean squared error (RMSE) and the coefficient of determination R2 , calculated on the test data set – from 2017 to 2021. To make conclusions about the physical relationships between Dst index and each of the input features, it is most interesting to analyze the sets minimal by the number of input features, obtained by the method using the step-by-step feature removal algorithm described above, based on LR and GB basic forecasting models with subsequent use of MLP. In the two figures above, input features that were selected as significant on the basis of LR (Fig. 2) and on the basis of GB (Fig. 3), are ranked in the descending order of importance. In both figures, the columns indicate the input physical quantities and the prediction horizon (from 1 to 6 h); the rows indicate the delay (embedding depth) of each feature relative to the moment of prediction (from 0 to 23). The features are ranked for each prediction horizon separately.
Determining the Significance of Input Features
375
Fig. 2. The most significant input features (smaller number denotes higher significance) ranked with MLP by the results of application of the algorithm of sequential feature removal based on LR. The columns display the input physical quantities and specify the prediction horizon from 1 to 6 h. The rows correspond to the delay in hours (embedding depth) relative to the moment of forecasting.
Fig. 3. The same as Fig. 2 for selection based on gradient boosting.
It should be noted that the present study is a direct continuation of our preceding study [12]. The two main extensions of the present study are the novel way to use MLP over the feature sets selected by LR and GB, and specific ranking of the significant features in the order of their significance. This allowed us to make our conclusions more physically based.
376
R. Vladimirov et al.
Fig. 4. Statistical indicators (top – root mean squared error, RMSE, in nT; bottom – coefficient of determination, R2 ) of various models on the test set of data (2017–2021):
Comparison of Fig. 2 and Fig. 3 demonstrates that the sets of essential features selected are somewhat different; to complete the study, it would be necessary to perform the selection of significant features with the studied algorithm on the basis of the MLP itself. Among the most significant features are various delays of the predicted Dst index itself, as well as Bz (GSM) and SW density with small delays (from 0 to 3 h) relative to the moment of prediction. Figure 4 shows the results of comparing the statistical indicators of different models on the test set of data (2017–2021) – top, root mean squared error (RMSE) in nT; bottom, coefficient of determination (R2 ). Figure 4 shows that the best forecasting quality of all the considered models is provided by GB. This matches the results we obtained earlier in [10] without any selection
Determining the Significance of Input Features
377
of significant features, where GB also demonstrated better results than MLP. However, to make the comparison of methods fully adequate, it is necessary for the selection of significant features each time to be carried out based on the same ML method that is used for prediction (in the current study, for MLPs this condition is not met). You can also pay attention to the fact that both for LR and for GB leaving only significant features in most cases does not lead to an improvement in the quality of the model. Nevertheless, in this way it possible to obtain a model with the same level of statistical indicators with a multiple fewer input features. A similar conclusion has also been made from the dependence (not presented here) of the values of statistical indicators on the number of features in the current selection. The effect of feature selection for MLPs is better: even when the method used as a basic algorithm for feature selection is other than the MLP itself (GB or LR), it is possible to obtain models with quality higher than that of MLP models built on the full set of features (setting aside the reduction in the computational cost of such a model). The observed effect can be explained in the following way. Different ML methods take the input data into account in different ways, what directly affects the selection, and what is clearly visible from the results shown in Fig. 4 for the R2 coefficient. MLP always uses all the input features; that is why, when redundant features are present in the set, MLP is most susceptible both to the effect of measure concentration and to overfitting. Some positive effect of lowering the input dimension can be observed for all forecast horizons: when using only input features selected with the help of LR, the coefficient of determination of the MLP models becomes greater than that of the MLP trained on the full feature set. Another situation is observed for LR and GB based models. LR contains many times fewer adjustable parameters than the MLP, and it turns out to be able to “turn off” insignificant features, setting small values of regression coefficients for them in the training process. The GB algorithm in the process of building each decision tree included in the model selects the most essential features explicitly. As a result, for LR and GB, input feature selection external to the algorithm has no or little effect on the results. The positive effect of reduction of the input dimension that is observed for MLP suggests that when MLP will be used as the base model, some set of significant features found during selection could provide even higher prediction quality. Despite the high computational cost of such numerical experiments, we plan to continue research in this direction to test this hypothesis.
5 Conclusions In this study, we consider an algorithm to determine the most effective machine learning model for predicting the geomagnetic index Dst. This is achieved by progressively removing input features using various machine learning techniques, including linear regression, gradient boosting, and multilayer perceptron type artificial neural network. Through our analysis, we draw the following key findings. • It is possible to decrease the number of inputs required for accurate prediction without compromising the quality of the predictions made by the models, by carefully selecting relevant input features.
378
R. Vladimirov et al.
• The adaptive methods select the most significant input features which, being judged by their physical meaning, match existing theories about the impact of various physical quantities on the disturbance of the Earth’s magnetosphere. The physical value of the y-component of the interplanetary magnetic field in the GSM system By was found to be the least significant one among the considered features (in fact, it was included in the initial list of input features to test the algorithm’s ability to detect that). The expected most significant values include the predicted Dst index across a wide range of delays, as well as the Bz component of the interplanetary magnetic field in the GSM system and proton density in the solar wind with minor delays in respect to the prediction time (0 to 3 h). • The gradient boosting algorithm demonstrated the highest prediction accuracy on the test data set when using either the full set of input features or a selection of significant input features determined by the same algorithm. However, as the forecast horizon increases, the prediction quality decreases significantly for all machine learning algorithms. • The findings from this study regarding the identification of important input features are applicable specifically to the prediction of the geomagnetic index Dst. However, the approach to feature selection used in this study is generalizable and can be applied to deal with other forecasting problems that involve multivariate time series. • To make this study logically complete, it is necessary to conduct a computationally heavy experiment. This experiment should focus on selecting important input features using the sequential feature removal algorithm considered in this study, but with a multilayer perceptron implementation. The outcomes of this experiment, which would involve predicting the Dst index using MLP on the set of features chosen with the help of MLP, should then be compared to the results obtained in this study using MLP on the set of features chosen by gradient boosting. Funding. This study has been performed at the expense of the Russian Science Foundation, grant no. 23-21-00237, https://rscf.ru/en/project/23-21-00237/.
References 1. McGranaghan, R.M., Camporeale, E., Georgoulis, M., Anastasiadis, A.: Space weather research in the digital age and across the full data lifecycle: introduction to the topical issue. J. Space Weather Space Clim. 11, 50 (2021). https://doi.org/10.1051/swsc/2021037 2. Qiu, Q., Fleeman, J.A., Ball, D.R.: Geomagnetic disturbance: a comprehensive approach by American electric power to address the impacts. IEEE Electr. Mag. 3(4), 22–33 (2015). https:// doi.org/10.1109/MELE.2015.2480615 3. Sugiura, M.: Hourly values of equatorial Dst for the IGY. Ann. Int. Geophys. 35, 9–45 (1964) 4. Burton, R.K., McPherron, R.L., Russell, C.T.: An empirical relationship between interplanetary conditions and Dst. J. Geophys. Res. 80, 4204–4214 (1975) 5. Lindsay, G.M., Russell, C.T., Luhmann, J.G.: Predictability of Dst index based upon solar wind conditions monitored inside 1 AU. J. Geophys. Res. 104(A5), 10335–10244 (1999) 6. Barkhatov, N.A., Bellustin, N.S., Levitin. A.E., Sakharov, S.Y.: Comparison of efficiency of artificial neural networks for forecasting the geomagnetic activity index Dst, Radiophys. Quantum Electron. 43(5), 347–355 (2000). https://doi.org/10.1007/BF02677150
Determining the Significance of Input Features
379
7. Revallo, M., Valach, F., Hejda, P., Bochníˇceket, J.: A neural network Dst index model driven by input time histories of the solar wind – magnetosphere interaction. J. Atmos. Sol. Terr. Phys. 110–111, 9–14 (2014). https://doi.org/10.1016/j.jastp.2014.01.011 8. Lazzús, J.A., Vega, P., Rojas, P., Salfate, I.: Forecasting the Dst index using a swarmoptimized neural network. Space Weather 15, 1068–1089 (2017). https://doi.org/10.1002/ 2017SW001608 9. Efitorov, A.O., Myagkova, I.N., Shirokii, V.R., Dolenko, S.A.: The prediction of Dst index based on machine learning methods. Cosm. Res. 56(6), 434–441 (2018). https://doi.org/10. 1134/S0010952518060035 10. Myagkova, I.N., Shirokii, V.R., Vladimirov, R.D., Barinov, O.G., Dolenko, S.A.: Prediction of the Dst geomagnetic index using adaptive methods. Russ. Meteorol. Hydrol. 46(3), 157–162 (2021). https://doi.org/10.3103/S1068373921030031 11. Zorich, V.A.: Multidimensional geometry, functions of very many variables, and probability. Theory Probab. Appl. 59(3), 481–493 (2015). https://doi.org/10.4213/tvp4578 12. Vladimirov, R.D., Shirokiy, V.R., Myagkova, I.N., Barinov, O.G., Dolenko, S.A.: Comparison of the efficiency of machine learning methods in studying the importance of input features in the problem of forecasting the Dst geomagnetic index. Geomagn. Aeron. 63, 161–171 (2023). https://doi.org/10.1134/S0016793222600795
Forest Damage Segmentation Using Machine Learning Methods on Satellite Images Natalya S. Podoprigorova1(B) , Grigory A. Savchenko1 , Ksenia R. Rabcevich1 Anton I. Kanev1 , Andrey V. Tarasov2 , and Andrey N. Shikohov2
,
1 Bauman Moscow State Technical University (National Research University), Moscow, Russia
[email protected] 2 Perm State National Research University, Perm, Russia
Abstract. Semantic segmentation from remote sensing imagery is a important task today. The article describes an approach to forest damage recognition using deep learning methods and compares models based on U-Net, MultiRes U-Net, Attention U-Net, ResNet50 U-Net, MobilNetv2 U-Net architectures. The data consists of pairs of multitemporal Sentinel-2 images in different regions of the European territory of Russia and the Urals and contains objects of different types, including clearcuts (1937), shelterwood cuts (1559), selective cuts (393), forest roads (316), burnt areas (1138) and windblows (1448). The Dice coefficient, sensitivity, specificity and F-measure are used for evaluation. The experiment results confirm that the use of U-shaped architecture is a promising direction for further improvement of satellite image analysis methods in the field of ecology and forestry. The MobilNetv2 U-Net model shows the best accuracy of forest damage segmentation in general (Dice = 77,22%). Attention U-Net model more accurately determines the class of windblows (Dice = 43,39%). The proposed approach demonstrates results in identifying forest damage areas, which can be useful for assessing the ecological situation in forested areas and conducting work on preserving forest resources. Keywords: forest damage · land cover · Sentinel-2 data · convolutional neural networks · satellite imagery · semantic segmentation · U-Net · attention U-Net MultiRes U-Net · ResNet50 U-Net · MobilNetv2 U-Net
1 Introduction Satellite imagery is often used to monitor the condition of forests. Using aerial images makes it possible to analyze large-scale changes in forest cover, monitor logging processes, assess the sustainability of forest ecosystems, and obtain a detailed understanding of forest cover, including in hard-to-reach and remote areas. All this helps to use forest resources more efficiently. The areas to be monitored are vast, so automation is required to perform these tasks. Segmentation of forest plantations based on satellite images is one of the important tasks in the field of cartography, geoinformatics and ecology. In this paper, we study the semantic segmentation [1] of forest areas on multispectral satellite images. Previous studies have shown that modern deep learning methods [2] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 380–388, 2023. https://doi.org/10.1007/978-3-031-44865-2_41
Forest Damage Segmentation Using Machine Learning Methods
381
cope with the task of image segmentation better than traditional machine learning methods [2], including statistical [3] and geometric [4]. Most often, the U-Net architecture [5–7] is used in the task of satellite image segmentation, and it shows good results. In this work, for the problem of forest damage segmentation on satellite images, we study U-Net with different encoders (ResNet50, MobilNetv2), as well as modifications of this architecture: MultiRes U-Net, Attention U-Net.
2 Dataset The training set was assembled and marked up by the authors of [16]. 6463 objects of different types are used for training, including clearcuts (1937), shelterwood cuts (1559), selective cuts (393), forest roads (316), burnt areas (1138) and windblows (1448). The data consist of pairs of multi-temporal Sentinel-2 images, the difference between which makes it possible to detect forest damage that occurred during this period [8, 9]. The original satellite images are cut into fragments of 256 × 256 pixels and reduced to a resolution of 10 m with square pixels. On Fig. 1 shows an example of data. The MSI sensor installed on the Sentinel-2 satellite takes pictures in 12 spectral channels, including visible spectrum channels, as well as different infrared bands (NIR and SWIR), which are the most informative for assessing the state of vegetation. The resolution of the RGB channels and the near infrared range is 10 m per pixel, and for the remaining channels 20 or 60 m per pixel.
3 Model For the task of semantic segmentation of forest areas using satellite images, the following CNN architectures are compared: U-Net, Attention U-Net Mul-tiRes U-Net, ResNet50 U-Net, and MobilNetv2 U-Net. 3.1 U-Net U-Net was originally developed for the interpretation of biomedical images [10]. The choice of the U-net architecture was due to its ability to segment selective and passing cuttings as integral objects, as well as to its generally successful application for recognition of forest cover disturbances. Also, the U-Net architecture preserves the size of the input segmented image without additional transformations, so the segmentation results can be easily combined with the initial data. The network architecture is symmetrical, has an encoder that extracts spatial features from the image, and a decoder that builds a segmentation map from the encoded features. The most original aspect of the U-Net architecture is the introduction of skip connections. At all four levels, the output of the convolutional layer before the encoder merge operation is passed to the decoder. These feature maps are then combined with the output of the upsampling operation, and the combined feature map is propagated to subsequent layers. Dropped connections allow the network to pass from the encoder to the decoder the spatial information that is lost during the merge operation. They provide a better propagation of the gradient throughout the network, which allows for faster training, as well as combining both low-level and high-level features.
382
N. S. Podoprigorova et al.
Fig. 1. An example of labeled data.
3.2 MultiRes U-Net In many problems it is important to analyze objects at different scales. In our case, this could allow us to apply the existing model to images from another satellite complex (not Sentinel-2) with a different image scale. The article [11] proposes an improvement of the U-Net architecture – MultiRes U-Net. To match features obtained from images at different scales, MultiRes U-Net includes 5 × 5 and 7 × 7 convolution operations in parallel with the 3 × 3 convolution operation (available in U-Net). However, in a deep network, if two convolutional layers are present in series, then the number of filters in the first one has a quadratic effect on memory [10]. So instead of keeping all three successive convolutional layers with the same number of filters, the filters in them are incrementally increased (from 1 to 3) to prevent the memory requirements of the earlier layers from propagating too far into the deeper part of the network. Residual linkage has also been add-ed due to its effectiveness in segmentation and to introduce 1 × 1 convolutional layers that may allow some additional spatial information to be understood. The authors of the network called it the “MultiRes block”.
Forest Damage Segmentation Using Machine Learning Methods
383
Also, the authors of the architecture write that the functions coming from the encoder are considered lower-level functions, since they are calculated at earlier levels of the network. On the contrary, the decoder functions are assumed to be of a much higher level since they are computed at very deep layers of the network. Therefore, they undergo additional processing. Therefore, we observe a possible semantic gap between the two feature sets being combined. Therefore, to reduce the mismatch between the encoder and decoder functions, the encoder functions are passed through an additional sequence of convolutional layers. Architecture authors call “Res Path”. In particular, 3 × 3 filters are used in convolutional layers, and 1 × 1 filters accompany residual joins. According to the article [11], MultiRes U-Net wins in performance and quality in the case of complex images that suffer from noise, distortion, and the lack of clear boundaries. And the results are more reliable and sustainable. 3.3 Attention U-Net The article [12] describes the use of the attention mechanism in the U-Net architecture. Typically in a CNN, convolutional layers apply the same filter to every pixel in an image. However, it may not make sense to treat every part of the image the same, regardless of the content - that’s exactly what you need to define the area of interest. To improve accuracy, modern segmentation frame-works [13–15] rely on additional prior object localization models to simplify the task of individual localization steps and subsequent segmentation. However, the same goal can be achieved by integrating attention gates into the standard CNN model. It does not require training of several models and a large number of additional model parameters. Attention mechanisms allow filters to be handled differently depending on the content and therefore make it easier to learn with fewer filters. This way we can effectively reduce weights or ignore features in certain areas of the image. 3.4 ResNet50 U-Net U-Net is an extension of a fully convolutional encoder-decoder network by adding skipped connections. The backbone is the architectural element that defines how the encoder network is built and they define how the decoder network should be built. ResNet50 and MobilNetv2 are used as the backbone in this work. ResNet-50 is a convolutional neural network with a depth of 50 layers. ResNet50 consists of 4 convolution blocks of 4 consecutive coding part sizes. Each block is implemented through the residual module, which learns the difference between the block’s input and output. 3.5 MobilNetv2 U-Net To obtain semantic information from the source image, the encoder uses the MobileNetv2 architecture for feature extraction. MobileNetV2 is an architecture optimized for mobile devices. It improves the performance of mobile models when running multiple tasks and tests, and when working with different model sizes. The decoder uses a U-Net based
384
N. S. Podoprigorova et al.
architecture. The decoder created full resolution segmentation maps. By skipping the connection at each stage, the decoder can be modified with multiple expansion modules or modifiers to improve its performance.
4 Experiments The data was divided into training and validation samples in a ratio of 17:3, respectively, and all damage types were reduced to a single class. As input data for training, 16 channels and their differences were used (old and new are the channels of images obtained before and after the change, respectively), namely: • Original image channels: B4old, B8old, B4new, B8new, B11old, B11new, B12old, B12new • Differences calculated from multitemporal images: B4old – B4new, B4new – B4old, B8old – B8new, B8new – B8old, B12old – B12new, B12new – B12old, B11old – B11new, B11new – B11old. The encoder and decoder parts of all models have four layers. Models was trained for 200 epochs with a learning rate of 1e-3 (after 15 epochs without changing the loss function, the learning rate decreases by 0.1 times). The batch size was set to 8. Dice loss was used as a loss function due to its robustness to class imbalance. The adaptive moment estimation optimizer (Adam) was chosen as the numerical optimization algorithm. This efficient optimizer uses the averages and second moments of gradients to maintain a learning rate that improves performance when solving problems with sparse gradients. The classifier completed its training after passing 200 epochs, and the model with the best Dice coefficient on the validation set was saved. The following metrics (1–4) are used to evaluate the result. In all metrics below TP are true positives, FP are false positives, and FN are false negatives. 2TP 2TP + FP + FN TP Sensitivity = TP + FN TN Specifity = TN + FP 5TP F − measure = 5TP + 4FN + FP Dice =
(1) (2) (3) (4)
5 Result Table 1 shows the results of semantic segmentation obtained using the U-Net, Attention U-Net, MultiRes U-Net, ResNet50 U-Net, and MobilNetv2 U-Net models (all damages are considered as a single class). Table 2 shows dice coefficient per classes. Figure 2 compares the loss functions and the Dice coefficient across epochs on the validation set, and Fig. 3 shows the comparison on the training set. The table shows the accuracies of the obtained models on the test set. Figure 4 shows the results of forest damage segmentation on the test dataset.
Forest Damage Segmentation Using Machine Learning Methods
385
Table 1. Model accuracies on the test dataset Model
Dice coefficient
Sensitivity
Specificity
F-Measure
U-Net
75,27%
76,65%
99,61%
75,28%
Attention U-Net
77,19%
81,13%
99,58%
77,19%
MultiRes U-Net
73,69%
68,23%
99,76%
73,69%
MobilNetv2 U-Net
77,22%
77,06%
99,68%
77,22%
ResNet50 U-Net
40,17%
41,53%
99,07%
40,17%
Table 2. Model accuracies per classes on the test dataset (Dice-coefficient) Model
Clearcutting
Shelterwood cutting
Forest road
Windblows
Burnt areas
Selective cutting
U-Net
66,61%
65,98%
2,14%
43,37%
94,41%
56,27%
Attention U-Net
66,73%
64,11%
3,39%
43,39%
94,78%
59,77%
MultiRes U-Net
67,69%
50,11%
1,53%
33,41%
95,63%
45,64%
MobilNetv2 U-Net
68,94%
65,08%
5,01%
34,27%
96,05%
62,29%
ResNet50 U-Net
34,06%
27,14%
4,05%
3,85%
87,55%
20,81%
Fig. 2. Loss function and Dice coefficient versus epoch on the validation dataset.
386
N. S. Podoprigorova et al.
Fig. 3. Loss function and Dice coefficient versus epoch on the train dataset.
Fig. 4. Examples of forest damage detection on Sentinel-2 satellite images from the test dataset.
Forest Damage Segmentation Using Machine Learning Methods
387
6 Conclusion Research of forest damage segmentation was conducted on a set of Sentinel-2 satellite images prepared by the authors of paper [16]. Models based on the U-Net, Attention U-Net, MultiRes U-Net, ResNet50 U-Net and MobilNetv2 U-Net architectures were trained. They were compared using the Dice coefficient, and the MobilNetv2 U-Net model shows the best results (Dice = 77,22%). At the same time, Attention U-Net model more accurately determines the class of windblows (Dice = 43,39%). Forest road class causes the greatest difficulties for all models, and we will focus on improving the definition of forest roads in future studies. Our future research will concentrate on the use of ensemble learning strategies. The results obtained confirms that the use of CNN is a promising direction for further improvement of methods for analyzing satellite images in the field of ecology and forest management. Acknowledgements. The study was funded by the Russian Science Foundation and the Perm Krai (project No. 22–27-20018).
References 1. Lateef, F., Ruichek, Y.: Survey on semantic segmentation using deep learning techniques. Neurocomputing 338, 321–348 (2019) 2. Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz, N., Terzopoulos, D.: Image segmentation using deep learning: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(7), 3523–3542 (2021) 3. Yi, L., Zhijun, G.: A review of segmentation method for MR image. In: 2010 International Conference on Image Analysis and Signal Processing, pp. 351–357. IEEE (2010) 4. Minaee, S., Fotouhi, M., Khalaj, B.H.: A geometric approach to fully automatic chromosome segmentation. In: 2014 IEEE Signal Processing in Medicine and Biology Symposium (SPMB), pp. 1–6. IEEE (2014) 5. Boston, T., Van Dijk, A., Larraondo, P.R., Thackway, R.: Comparing CNNs and random forests for landsat image segmentation trained on a large proxy land cover dataset. Remote Sens. 14(14), 3396 (2022) 6. Ulmas, P., Liiv, I.: Segmentation of satellite imagery using u-net models for land cover classification. arXiv preprint arXiv:2003.02899 (2020) 7. Avenash, R., Viswanath, P.: Semantic segmentation of satellite images using a modified cnn with hard-swish activation function. In: VISIGRAPP (4: VISAPP), pp. 413–420 (2019) 8. Tarasov, A.V.: Rapid mapping of private forest cover based on satellite data with a particularly high risk of temporal spread. The dissertation of the candidate of technical sciences. 25.00.33. Perm, 2021. 135 p. 9. Khovratovich, T.S., Bartalev, S.A., Kashnitsky, A.B.: A method for detecting forest changes based on a sub-pixel estimate of the projective cover of the tree canopy from multi-temporal satellite images. Modern Probl. Remote Sens. Earth Space 16(4), 102–110 (2019) 10. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-31924574-4_28
388
N. S. Podoprigorova et al.
11. Ibtehaz, N., Rahman, M.S.: MultiResUNet: rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw. 121, 74–87 (2020) 12. Oktay, O., et al.:. Attention u-net: learning where to look for the pancreas. arXiv preprint arXiv:180 (2018) 13. Khened, M., Kollerathu, V.A., Krishnamurthi, G.: Fully convolutional multi-scale residual DenseNets for cardiac segmentation and automated cardiac diagnosis using ensemble of classifiers. Med. Image Anal. 51, 21–45 (2019) 14. Roth, H.R., et al.: Spatial aggregation of holistically-nested convolutional neural networks for automated pancreas localization and segmentation. Med. Image Anal. 45, 94–107 (2018) 15. Roth, H.R., et al Hierarchical 3D fully convolutional networks for multi-organ segmentation. arXiv preprint arXiv:1704.06382 (2017) 16. Tarasov, A.V., Shikhov, A.N., Shabalina, T.V.: Recognition of forest cover disturbances from sentinel-2 satellite images using convolutional neural networks. Modern Probl. Remote Sens. Earth Space 18(3), 51 (2021)
Binding Affinity Prediction in Protein-Protein Complexes Using Convolutional Neural Network Elizaveta A. Bogdanova1(B) , Valery N. Novoseletsky1,2 , and Konstantin V. Shaitan1 1 Lomonosov Moscow State University, Moscow, Russia
[email protected] 2 Shenzhen MSU-BIT University, Shenzhen, China
Abstract. Binding affinity is an important characteristic of protein-protein interactions, its determination is significant for the development of a wide range of drugs and biotechnological preparations. This paper presents an algorithm based on convolutional neural networks that predicts the value of the dissociation constant for protein-protein complexes from their spatial structures, as well as a method for converting this data format into a suitable one for use in neural networks. Keywords: Convolutional Neural Network · Bioinformatics · Affinity Prediction · Protein-Protein complexes
1 Introduction Reliable information about the strength of protein interactions and their presence in physiological and pathophysiological processes is of decisive importance for the development of therapeutic and diagnostic tools based on the work of protein-protein complexes. Experimental methods for determining the binding force are the most accurate, but they have a few limitations, as well as high labor intensity and cost. Computational methods can significantly reduce the set of potential interactions to a subset of the most probable ones, which will serve as a starting point for further laboratory experiments [1, 2]. Currently, a fairly large number of methods have been developed that predict the binding affinity in protein-protein and protein-ligand complexes. However, so far it has not been possible to identify a method that predicts with high accuracy for complexes of various nature. This phenomenon may be due to several limitations: the ambiguity of experimental data, the lack of consideration of the conformational changes that occur during binding, or the presence of cofactors that may be required for binding, as well as the insufficient volume and variety of data [3]. Currently, more complex and non-linear algorithms for predicting the strength of binding in protein complexes are being developed to consider as many limitations and factors as possible that affect binding affinity. One of the most promising prediction methods currently are machine learning methods based on neural networks, which will be discussed further. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 389–397, 2023. https://doi.org/10.1007/978-3-031-44865-2_42
390
E. A. Bogdanova et al.
Most of the developments on this topic are aimed at studying protein-ligand complexes, and a high quality of prediction has been achieved for this task. In 2017, the Pafnucy protein-ligand binding prediction algorithm was implemented, based on deep convolutional neural networks, and using the spatial structures of the complexes as training data [4]. As a result, for the test set, a Pearson correlation value of 0,78 and RMSE = 1,42 was achieved between the predicted and experimentally calculated values. In the work of 2018 these metrics were improved to a Pearson correlation value of 0,82 and RMSE = 1,27 [5]. In 2019, an algorithm was implemented, also based on DeepAtom deep convolutional neural networks, solving the same problem with a correlation value of 0,83 [6]. In addition, algorithms trained on other datasets have been released that provide a high quality of prediction [7]. As for the prediction of binding in protein-protein complexes, the situation here is much more complicated since both molecules in the complex are high molecular weight and have many complex interactions. Currently, the implemented algorithms are divided into two groups: those that perform binary classification by the presence of binding [8], and those that solve the regression problem by learning from data on the amino acid sequence or spatial structure [9, 10]. In the first case, it was possible to achieve a sufficiently high quality of prediction (accuracy = 0,93), but the result is not informative enough, and in the second case, the prediction accuracy is quite low (Pearson correlation = 0,44) for prediction by amino acid sequences [9]. The quality of prediction using structures has a higher correlation value (0,5–0,6), however, only complexes consisting of two molecules are used [10]. Thus, there remains a sufficiently large field for the study of protein-protein complexes, and the creation of algorithms that predict the strength of binding between proteins with higher accuracy.
2 Problem Statement The aim of this work is to develop an algorithm capable of predicting the strength of binding between proteins in complexes based on their spatial structures. When solving this problem, a few limitations and difficulties arise, which were solved in this work. First, the structures of the complexes are extremely heterogeneous in size and shape of the binding interface, which creates difficulties that are not encountered when working with protein-ligand complexes. Accordingly, it is necessary to choose the optimal cell size for localizing the binding area. Secondly, amino acids and their atoms have a wide range of characteristics that can be used as features of the complex. However, for the optimal operation of the predictive algorithm, it is necessary to reduce the feature space as much as possible, especially given the limited amount of data for training. The third and most acute problem is the limited number of protein-protein complexes with a known spatial structure and dissociation constant. This limitation most acutely affects the efficiency of training neural networks. To obtain a high-quality prediction, it is necessary to use data augmentation methods.
Binding Affinity Prediction in Protein-Protein Complexes
391
3 Data Collection and Preparing The training dataset was assembled using protein databases containing information on the structure and characteristics of protein–protein complexes (Protein Data Bank (PDB) [11], PDBBind [12]). In addition to the structural files of the complexes, these databases contain characteristics of the strength of binding between proteins. The purpose of the algorithm is to predict the value of the dissociation constant (Kd, for convenience, we use pKd = –log(Kd)), which is inversely proportional to the strength of binding. Based on the distribution of target values in the dataset we use (Fig. 1), one can notice that there are quite a few complexes with pKd < 3 or pKd > 12, which is likely to lead to poor prediction quality in these areas.
Fig. 1. Distribution of target values (pKd), in the training data set
Before submitting data to the input of the algorithm, the files are converted into a machine-readable format. Preprocessing includes several stages. 3.1 Localization of the Binding Area According to the coordinates, a 3D cell is constructed that limits the binding region of molecules in the complex (21 × 61 × 61 with a resolution of 1 Angstrom, along the largest length of the protein binding region), the atoms of molecules outside this cell are not taken into account (Fig. 2A). The creation of such a cell occurs in several stages. First, the atomic coordinates of the binding molecules were extracted from the structure files using the PDBParser library. For complexes containing more than two molecules, the choice of molecules between which binding occurs with a known constant (1913 structures) was manually carried out. Next, the centers of mass of the molecules and the vector connecting them were calculated (Fig. 2B). To determine the center of the binding area and the future cell, a separating plane was calculated using the logistic regression method. The algorithm was given the coordinates of the atoms and the label of the molecule (0 for the atoms of one molecule, 1 for the other), and training took
392
E. A. Bogdanova et al.
place. And as a result, weights and bias were extracted from the trained algorithm, on the basis of which the equation of the separating plane of the form: W1 x + W2 y + W3 z + β = 0, where W1, W2, W3 – weights, β – bias. The point of intersection of the separating plane and the vector connecting the centers of mass was taken as the center of the cell (Fig. 2C). To fix the position of the cell relative to the dividing plane, three straight lines characterizing the position of the cell in the coordinate system were determined. To do this, we chose planes parallel to the separating one at a distance of 3 Å, among the atoms lying on these planes, we chose the two most distant ones, and a straight line was constructed from them, characterizing the longest length of the interaction region. The projection of this straight line onto the separating plane was later used as the OX axis in the new basis, the normal to the plane as the OZ axis, and the vector orthogonal to them as the OY axis. As a result, the cell center was shifted to the origin of coordinates, and the basis was transferred to the obtained vectors, so that the separating plane coincided with the XOY plane. And then the boundaries of the cell with a size of 61 × 61 × 21 were determined: length, width, and height, respectively (Fig. 2D). The coordinates of atoms inside the cell were translated into a three-dimensional array of identical size.
Fig. 2. Cell construction stages: A) a complex with a manually selected binding area; B) atoms of binding molecules; C) a complex with a calculated separating plane; D) the resulting cell, within which atoms are selected for the array
Binding Affinity Prediction in Protein-Protein Complexes
393
3.2 Adding channels To display the properties of the complex, channels were added to the resulting array, reflecting possible interactions between the atoms of two protein molecules, such as hydrogen bonds, hydrophobic interactions, stacking interactions, etc. As a result, 10 channels were added: 1–4 include four options for combinations of atoms. So, in one channel there are hydrogen bond acceptors of one protein and donors of another, in the second channel, on the contrary, and two more channels, including acceptors and weak donors; 5–6 include positively and negatively charged atoms in two combinations; 7 – hydrophobic atoms of the first and second proteins; 8 – carbonyl oxygens of both proteins; 9 – carbonyl carbons of both proteins; 10 – aromatic atoms. Atoms of one protein in all channels were written as -1, atoms of another protein as 1. As a result, each complex was represented as a four-dimensional tensor 18 × 21 × 61 × 61. The size of the training data set was 2270 structures, the test dataset was 128. For validation, 90 complexes were selected with different constants and without resolution restrictions. The distribution by target values in the validation set has the same shape as in the training set, so it fully reflects all possible complexes with different constants. For the test, protein complexes with a high structural resolution (less than 2.8 Å) were selected, which are not in the training data since complexes for one protein, even with mutations, can be very similar, which will lead to data leakage.
4 Construction and Optimization of the Algorithm In connection with the format of the training data and the results of similar works described earlier, it was decided to build a predictive algorithm based on convolutional neural networks. The construction and training of the algorithm was carried out using the torch framework. During this work, several models were tested, with different numbers of convolutional and fully connected layers. The final neural network contains: 3D convolutional layers: the optimal number of convolutional layers turned out to be three layers with a decrease in the size of the convolution kernel and an increase in the number of channels (32, 64 and 128). Since the data used for training focuses on atoms capable of forming contacts, the decisive factor in choosing the size of the convolution kernel was the maximum distance between the binding atoms. Proceeding from this, the optimal kernel size of the first layer is 7, for the subsequent 5 and 3, respectively. After each convolutional layer, there was a MaxPooling layer, which halves the size of the tensor. In addition, the ReLu activation function was used for each layer. Fully Connected Layers, in the amount of three pieces, with a decrease in the number of channels (1000, 200, 1). For the first two layers, the ReLu activation function was used. In the last layer, the pKd value is directly predicted. Thus, the algorithm takes a 4D tensor of size 10 × 21 × 61 × 61 as input, and at the output we get one number that reflects the value of pKd.
394
E. A. Bogdanova et al.
For the purposes of normalization and regularization, Batch Normalization was used after the convolutional layers and after the first two fully connected Dropout layers (with a dropout probability of 0,3). AdamW was chosen as the optimizer. The choice of the optimizer is due to its efficiency and the possibility of adding L2 regularization (parameter weight_decay). As a result of the selection of hyperparameters, a learning rate equal to 0,0001 and weight_decay = 0,001 was used. To calculate the error, the loss function MSELoss was used, which is suitable for solving the regression problem. During optimization, a batch size was also selected, which affects the efficiency of Batch Normalization and the productivity of training, as a result, based on the size of the training sample, batch size = 32 was selected. Due to the limited set of training data, methods of data transformation during training were used to improve the quality of prediction. So, before the input to the algorithm in the array with a probability of 0,5, an independent mirror image of the values along each of the axes (x, y, z) was made. This technique retains all the features important for prediction but prevents memorizing the location of specific atoms. And additionally, also with a probability of 0,5, the designations of atoms of one protein were changed to another (–1 was replaced by 1 and vice versa). These methods made it possible to slow down the onset of retraining. For the target value, standardization was carried out by subtracting the mean and dividing by the variance, which is necessary to stabilize the values of the loss function.
5 Algorithm Testing and Analysis As a result of training the algorithm for 30 epochs, the best model was saved and tested on the test set. Pearson’s correlation and RMSE were chosen as quality metrics, reflecting the average error in absolute values. As a result, on the validation dataset the Pearson correlation value is 0,63, RMSE = 1,48, on the test the Pearson correlation value is 0,61 (p-value = 4e-14), RMSE = 1,52. The largest error is typical for complexes with the extreme pKd values. This is due to their underrepresentation in the training data set (Fig. 3). The obtained metric values surpass the existing methods for predicting the dissociation constant in protein-protein complexes [9] based on the analysis of amino acid sequences and are comparable in quality of prediction for complexes consisting of two molecules [10], which indicates the effectiveness of the predictive algorithm, given that it was also tested on complexes in which binding occurs between several molecules. With this predictive quality, the resulting model can be used in the early stages of drug development processes that focus on screening and optimizing protein/peptide binding agents for a given protein target. Even one mutation in a protein can change the binding constant by 100–1000 times or more, and this method will reduce the number of experiments required, since it will be possible to screen out molecules with a much lower affinity than necessary in advance. An analysis was also made of the importance of treatment, which in this task play the role of signs. This algorithm was trained with L2, so we can evaluate the significant features by looking at the weight distributions associated with the convolutional filters in the first detected detection (Fig. 4).
Binding Affinity Prediction in Protein-Protein Complexes
395
Fig. 3. Results of predicting the pKd value on the test dataset
Fig. 4. Range of weights for each input channel (feature). Outliers are not shown
During training, weights tend to spread out and form wider ranges as weights with higher absolute values convey more information to the deeper layers of the network. Given the presence of L2 regularization, only the critical weights were likely to have such high absolute values.
396
E. A. Bogdanova et al.
The widest range is typical for channels containing strong acceptors and donors of hydrogen bonds and a channel with aromatic atoms; therefore, the model focuses most of its prediction on these features, which is consistent with the known patterns of molecular binding. Aromatic compounds form stacking interactions that can enhance the bond between proteins, and hydrogen bonds underlie the formation of all levels of protein organization, starting from the secondary structure, which makes this type of contacts one of the key ones in intermolecular interactions as well.
6 Conclusions A method for predicting the dissociation constant in protein-protein complexes based on a convolutional neural network has been proposed and implemented. This algorithm makes it possible to evaluate the binding between proteins in complexes with a sufficiently high quality based on their spatial structures for a heterogeneous set of test complexes. The program of further research will include the expansion of the training data set by molecular dynamics methods to improve the efficiency of prediction. Also, the trained algorithm is planned to be used to evaluate binding in experimentally obtained protein-protein structures. Funding. The work was financially supported by the Non-commercial Foundation for the Advancement of Science and Education INTELLECT.
References 1. Shi, T.L., Li, Y.X., Cai, Y.D., Chou, K.C.: Computational methods for protein-protein interaction and their application. Current Protein Peptide Sci. 6(5), 443–449 (2005) 2. Novoseletsky, V.N., Volyntseva, A.D., Shaitan, K.V., Kirpichnikov, M.P., Feofanov, A.V.: Modeling of the binding of peptide blockers to voltage-gated potassium channels: approaches and evidence. Acta Naturae 8(2), 35–46 (2016) 3. Kastritis, P.L., Bonvin, A.M.: On the binding affinity of macromolecular interactions: daring to ask why proteins interact. J. R. Soc. Interface, 10, 20120835 (2013) 4. Stepniewska-Dziubinska M.M., Zielenkiewicz, P., Siedlecki, P.: Development and evaluation of a deep learning model for protein-ligand binding affinity prediction (2017) 5. Jiménez, J., Škaliˇc, M., Martínez-Rosell, G., Fabritiis, G.D.: KDEEP: protein-ligand absolute binding affinity prediction via 3D-convolutional neural networks. J. Chem. Inf. Model. 58(2), 287–296 (2018) 6. Li, Y., Rezaei, M.A., Li, C., Li, X., Wu, D.O.: DeepAtom: a framework for protein-ligand binding affinity prediction. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 303–310 (2019) 7. Zhang, H., Liao, L., Saravanan, K.M., Yin, P., Wei, Y.: DeepBindRG: a deep learning based method for estimating effective protein–ligand affinity. PeerJ 7, e7362 (2019) 8. Asim, M.N., Ibrahim, M.A., Malik, M.I., Dengel, A., Ahmed, S.: ADH-PPI: an attentionbased deep hybrid model for protein-protein interaction prediction. Iscience 25(10), 105169 (2022) 9. Abbasi, W.A., Yaseen, A., Hassan, F.U., et al.: ISLAND: in-silico proteins binding affinity prediction using sequence information. BioData Mining 13, 1–13 (2020)
Binding Affinity Prediction in Protein-Protein Complexes
397
10. Romero-Molina, S., et al.: PPI-Affinity: a web tool for the prediction and optimization of protein-peptide and protein-protein binding affinity. J. Proteome Res. 21, 1829–1841 (2022) 11. RCSB Protein Data Bank. https://www.rcsb.org/ 12. PDBbind-CN Database. http://www.pdbbind.org.cn/index.php
Domain Adaptation of Spacecraft Data in Neural Network Prediction of Geomagnetic Dst Index Elvir Z. Karimov1,2(B) , Vladimir R. Shirokiy2 and Irina N. Myagkova2
, Oleg G. Barinov2 ,
1 Physical Department, Lomonosov Moscow State University, Moscow, Russia
[email protected] 2 Skobeltsyn Institute of Nuclear Physics, Lomonosov Moscow State University, Moscow,
Russia
Abstract. This study focuses on improving the neural network prediction of the geomagnetic indexes, in particular Dst-index, in a scenario, where input data is collected by two spacecraft (SC) with different data availability. One of the SC is approaching the end of its operational lifespan, while the other one lacks sufficient data history for constructing a high-quality neural network prediction. To effectively perform the transition between the two SC data, domain adaptation methods are needed. The study evaluates and compares various data translation techniques and optimizes the parameters for each translated feature to minimize domain discrepancies. The findings highlight the enhancement in the forecast, when employing domain adaptation methods and selecting relevant features, surpassing the results obtained using untranslated data. Keywords: time series prediction · domain adaptation · feature selection
1 Introduction Forecasting space weather phenomena, such as geomagnetic disturbances, is an important practical task, the relevance of which will increase [1, 2]. Strong geomagnetic disturbances can cause disruption of radio communications, operation of electrical networks, power lines, or pipelines [3]. To describe the intensity of a geomagnetic disturbance, special geomagnetic indices are used. Currently, the Dst index (Disturbance Storm Time index) [4] is often used to analyze the state of the Earth’s magnetosphere. The Dst index is an estimate of the axisymmetric component of the disturbed magnetic field relative to the geomagnetic dipole, and it is determined based on measurements of the magnetic field at four groundbased near-equatorial stations. The magnitude of the magnetic field disturbance is determined at each station. The Dst index is defined as the longitude average perturbation brought to the equator. Since the Earth’s magnetosphere is an open multicomponent dynamic system in which it is difficult to describe the ongoing processes, at the moment there are no universal physical models that allow prediction of geomagnetic disturbances. For this reason, we © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 398–405, 2023. https://doi.org/10.1007/978-3-031-44865-2_43
Domain Adaptation of Spacecraft Data in Neural Network
399
use machine learning methods – artificial neural networks (ANN). The main source of geomagnetic disturbances are processes occurring in the Sun, which, in turn, cause variations in the parameters of the interplanetary magnetic field (IMF) and solar wind (SW) in the heliosphere [5]. Therefore, when predicting the Dst index, measurements of IMF and SW parameters obtained at the L1 Lagrange point are used as input data. The dynamics of such a system is described by a multidimensional time series (TS). The use of ANN suggests that the TS should be long, stationary or at least quasistationary, possibly obtained from a single source. This can be a problem if the data source is an experiment on board a spacecraft. Due to technical reasons (degradation and failure of the measuring equipment on board the spacecraft), it may be necessary to switch from the data of one spacecraft to the data of another one. Such a transition can reduce the quality of the forecast, since measuring instruments of different spacecraft may have different characteristics. To solve this problem, we use domain adaptation methods. Domain adaptation is the mapping of data from the source domain to the target domain. Using domain adaptation, it is possible to generate target domain data based on the source domain data. Within this study, we transfer data from the domain of the new spacecraft to the domain of the old spacecraft, thus allowing effective use of the data of the new spacecraft. This may make it possible to use the measurement results of the same physical quantities made by different spacecraft together.
2 Problem Statement 2.1 Used Parameters As already mentioned, the state of the Earth’s magnetosphere and the impact on it from the Sun are determined by a number of parameters, the main of which are the parameters of SW and IMF. Since the Dst index is determined, as mentioned above, by the conditions in the interplanetary environment [5], the following parameters with a time resolution of one hour are used in this study to transfer data from one domain to another: • Data on IMF values – z component (Bz ) in the GSM system and its absolute value (|B|); • Data on the parameters of the SW plasma: SW speed, SW proton density. Besides the values of the SW and IMF parameters at the current time, also their history is taken into account. 2.2 Spacecraft Two spacecraft in halo orbits around the Lagrange point L1 of the Sun-Earth system are considered. This arrangement is suitable so that the data mostly depends on time, and not on the dimensional position. Also, at such orbit it is possible to ensure the long-term existence of the spacecraft without significant fuel consumption. At the moment, there are several spacecraft in halo orbits around the L1 point. For the current forecast, we use data from the ACE spacecraft (Advanced Composition Explorer)
400
E. Z. Karimov et al.
[6]. It was launched in 1997, and a lot of data has accumulated from it – about 220,000 hourly averages. Another spacecraft is DSCOVR (Deep Space Climate ObserVeR) [7], it was launched in 2015. To test and train domain adaptation methods, we will use data from the two spacecraft for their common period of operation, with a volume of about 50,000 hourly averages. 2.3 Work Purpose No later than 2026, the ACE spacecraft is planned to be decommissioned due to the end of fuel reserves. Also, this spacecraft has serious problems with real time data transmission – the online data has quite large gaps, which make online forecasting impossible. Thus, the task of switching to DSCOVR spacecraft data is relevant. Data from the DSCOVR spacecraft is not yet sufficient to train an equally highquality ANN model for predicting geomagnetic disturbances. This is clearly seen from Fig. 1. The presented results have been obtained with two different ANN models, each trained on the data of the corresponding spacecraft and applied to the independent data of the same spacecraft taken during the common time period (the test set data described below in the Results section).
Fig. 1. Comparison of the quality of the Dst-index forecast by models, trained on data from different spacecraft (RMSE on independent data).
Direct application of a neural network trained on data from ACE spacecraft to the data from DSCOVR spacecraft will not lead to an improvement in the result, since the data differ much from each other. However, domain adaptation may significantly reduce the difference (Fig. 2). Differences in measurements of various spacecraft can occur for various reasons, for example, differences in the operation of detectors, and their different locations. Also, the equipment of the old spacecraft could degrade during its stay in orbit.
Domain Adaptation of Spacecraft Data in Neural Network
401
Fig. 2. An example of the difference between the SW speed and density measurements of the ACE and DSCOVR spacecraft, and the results of domain adaptation.
3 Description of the Experiment 3.1 Data Preprocessing Linear interpolation of gaps up to 12 missing hourly values in a row was applied to all data, larger gaps were removed. Also, delay embedding of TS was used to account for the history of the parameters, the meaning of which is to include information about several preceding TS values into each pattern. To find out up to what point in time to take into account the history of each parameter (embedding depth), we will use the autocorrelation function. After the value of the autocorrelation function drops e fold, we stop taking into account further history. Thus, the embedding depth values obtained for each of the parameters were: Bz – 3 h, |B| – 14 h, SW density – 10 h, SW speed – 55 h. For the SW speed, instead of the entire history for 55 h, we will use only those delays that correspond to the Fibonacci numbers (1, 2, 3, 5, 8, 13, 21, 34, 55 h).
402
E. Z. Karimov et al.
3.2 Data Translation Methods In this study, data was transferred from the DSCOVR domain to the ACE domain. The following machine learning algorithms were used to transform the parameters: Linear Regression (LR), MultiLayer Perceptron (MLP) ANN, and Gradient Boosting (GB) over decision trees. Three data translation methods were used to transfer data from one domain to another one. The first method is “one to one”, when the only parameter being converted is translated into the corresponding parameter of another domain. The second is “all to one”, when in order to obtain the value of the target parameter in the new domain, the values of all the parameters used in the original domain are supplied to the input of the converting algorithm. The third one is “feature selection”, when for each target parameter, an optimal set of input parameters is determined that allows for the most efficient conversion between domains. When using MLP, data normalization has always been used for input and output data for all the three variants of data translation. The CORAL (CORrelation ALignment) [8] domain adaptation method was also used, with only one way of translating data “one to one”.
4 Results The data for translation were divided into training and test sets in the ratio of 85:15, respectively. The total set of TS data patterns, when simultaneous operation of the two spacecraft was recorded, from 27.07.2016 to 07.01.2023, amounted to 50,600 hourly averages: the training set from 27.07.2016 to 24.02.2022 – 43,000 hourly averages, the test set from 24.02.2022 to 07.01.2023 – 7,600 hourly averages. The statistical results of data translation from the DSCOVR domain to the ACE domain are shown in Fig. 3. Also, an example of data translation on a fragment of the TS is presented in Fig. 2, In some cases, the algorithm failed to reduce the deviation. This could happen due to a large number of input parameters that were not significant for this transformation, and only introduced an extra error. It is possible to notice that linear algorithm turned out to be the best one for Bz and |B|. This fact may indicate that two spacecraft measure these quantities similarly, differing only by an offset and/or a change in scale. This is especially noticeable due to the fact that the best result is given by the “one to one” mapping. Nonlinear algorithms turned out to be optimal for SW parameters. This indicates the presence of some nonlinear dependence between the measured values from different spacecraft. Also, in this case, the “all in one” translating shows the result better than “one to one”, which may indicate a strong relationship of these quantities with other system parameters. Note that for the SW density, we managed to obtain the largest reduction in the distance between the domains. GB and MLP showed almost the same result for SW density, but the error of GB is slightly smaller. In the case of SW speed, gradient boosting turned out to be the best. Also, by a complete search of combinations of physical quantities, optimal sets of
Domain Adaptation of Spacecraft Data in Neural Network
403
Fig. 3. The result of converting each of the parameters from the DSCOVR domain to the ACE domain – root mean squared between the translated and the target parameters. The rightmost column corresponds to the initial difference between the domains.
input parameters were obtained for each of the transformed features for all algorithms (LR/MLP/GB). The following parameters turned out to be significant: • For Bz : only Bz for all algorithms; • For |B|: |B| (LR)/Bz , |B|, (MLP and GB) • For SW density: |B|, SW density and speed (LR)/SW density and speed (MLP and GB) • For SW speed: Bz , |B|, SW speed and density (LR)/SW density and speed (MLP)/Bz , SW density and speed (GB) For each parameter, all the features corresponding to various values of the embedding depth were used or were not used at once. It should also be stressed that the listed parameters are significant for data translation between the domains and not for prediction of the Dst index.
404
E. Z. Karimov et al.
The selection of significant parameters further reduces the differences between data in different domains. In the case when the selection of significant parameters shows the same result as the “one to one” case, it means that there is only one significant parameter left – the translated parameter itself. To test the effectiveness of the domain adaptation, the entire test set was converted from the DSCOVR domain to the ACE domain, each parameter converted by the method optimal for it. Next, MLP neural networks were applied to the transformed data arrays to predict the value of the Dst index one hour ahead (five networks with identical parameters, differing by initialization of weights). The networks were trained on a full array of data from the ACE spacecraft (all available data since October 1997 were used, except for the data included in the test set). Figure 4 shows the results of applying networks to 4 variants of test data obtained in different ways within the same time interval: untransformed data from the ACE and DSCOVR spacecraft, data obtained by adapting from the DSCOVR domain to the ACE domain (DSCOVR → ACE) by the optimal method for each parameter, as described above, and the CORAL method.
Fig. 4. The result of the Dst index forecast for different data variants on the test set.
It can be seen that the data from the training domain (ACE) shows the best error. When applying the same networks to data from the DSCOVR domain, the expected increase of error and variance occurs. Domain adaptation methods were able to reduce the difference between the data and improve the quality of the Dst index forecast. The CORAL algorithm proved to be worse than optimal methods determined in this study, slightly improving the quality of the forecast without reducing the variance.
5 Conclusions The use of domain adaptation made it possible to improve the quality of the Dst index forecast. The described methods helped to reduce the prediction error and its variance on independent data when using data from another domain.
Domain Adaptation of Spacecraft Data in Neural Network
405
The problem of transferring forecasting systems from the data of one spacecraft to the data of another spacecraft measuring the same values can be partially solved by converting data from the domain of the “new” spacecraft to the domain of the “old” one. The efficiency of such a transformation will increase with an increase in the length of data arrays for which the results of parallel measurements on both spacecraft are available. This technique can also be used for other pairs of spacecrafts where a similar problem arises, for example, spacecraft of the GOES series. Funding. This study has been performed at the expense of the Russian Science Foundation, grant no. 23-21-00237, https://rscf.ru/en/project/23-21-00237/.
References 1. Lazutin, L.L.: Global and polar magnetic storms. MSU (2012). (in Russian) 2. McGranaghan, R.M., Camporeale, E., Georgoulis, M., Anastasiadis, A.: Space weather research in the digital age and across the full data lifecycle: introduction to the topical issue. J. Space Weather Space Clim. 11 Art. 50 (2021). https://doi.org/10.1051/swsc/2021037 3. Belakhovsky, V.B., Pilipenko, V.A., Sakharov, Y.A., Selivanov, V.N.: Growth of geomagneticinduced currents during coronal mass ejection and corotating solar wind streams of geomagnetic storms in 2021, News of the RAS. Phys. Ser. 87(2), 271–277 (2023). EDN: AITPFO (in Russian). https://doi.org/10.31857/S0367676522700478 4. Geomagnetic Data Service. https://wdc.kugi.kyoto-u.ac.jp/dstdir/ 5. Akasofu, S.-I., Chapman, S.S.: Solar-Terrestrial Physics, p. 889. Clarendon Press, Oxford (1972) 6. Real-Time Solar Wind Data. https://www.nasa.gov/ace/ 7. DSCOVR Space Weather Data Portal. https://solarsystem.nasa.gov/missions/DSCOVR/indepth/ 8. Sun, B., Feng, J., Saenko, K.: Return of frustratingly easy domain adaptation. In: AAAI (2016)
LQR Approach to Aircraft Control Based on the Adaptive Critic Design Maxim I. Chulin, Yury V. Tiumentsev, and Ruslan A. Zarubin(B) Moscow Aviation Institute, National Research University, Moscow, Russia {vip.chulin,rs2018h}@mail.ru
Abstract. Motion control of modern and advanced aircraft has to be provided under conditions of considerable and diverse uncertainties in the values of their parameters and characteristics, flight regimes, and environmental influences. The aircraft control system must be able to adapt to these changes by promptly adjusting the control laws used. The tools of adaptive control theory allows us to satisfy this requirement. In this case, it is very desirable not only to provide the created system with the property of adaptivity, but also to do it in an optimal way. An effective way to implementing this kind of adaptivity concept is the approach based on machine learning methods and tools, including technologies based on reinforcement learning. One of the approaches to the synthesis of control laws for dynamical systems that are widely used at present is the LQR (Linear Quadratic Regulator) technique. A significant limitation of this approach is the lack of adaptivity in the resulting control law, which prevents it from being used under conditions of incomplete and inaccurate knowledge of the properties of the control object and the environment in which it operates. To overcome this limitation, it was proposed to modify the standard variant of LQR based on approximate dynamic programming, a special case of which is the Adaptive Critic Design (ACD). For the ACD-LQR combination, the problem of longitudinal angular motion control of a maneuverable aircraft is solved. The results obtained demonstrate the capabilities of this approach to controlling the motion of an aircraft under uncertainty conditions. Keywords: aircraft · motion control · linear quadratic regulator · machine learning · reinforcement learning · approximate dynamic programming · adaptive critic design
1
Introduction
Motion control of modern and advanced aircraft has to be provided under conditions of considerable and diverse uncertainties in the values of their parameters and characteristics, flight regimes, and environmental influences. In addition, a variety of abnormal situations, such as equipment failures and structural damage, may arise during flight. The control system of the aircraft must be able to c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 406–419, 2023. https://doi.org/10.1007/978-3-031-44865-2_44
ACD-LQR Approach to Aircraft Control
407
adapt to these changes by promptly adjusting the parameters and/or structure of the control laws used. The adaptive control theory [1,2] makes it possible to meet this requirement. In this case, it is highly desirable not only to endow the created system with the property of adaptivity, but also to do it in the best possible way in a given sense, that is optimal. One of the effective ways to implement this kind of adaptivity concept is the approach based on machine learning methods and tools, including technologies based on reinforcement learning (RL) [3] and artificial neural networks (ANN) [4]. In the general case, the aircraft as a control object belongs to the class of continuous-time nonlinear dynamical systems. The reinforcement learning combined with artificial neural network technique has proven to be quite effective working tool for synthesis of optimal and adaptive control laws for objects of this class [5,6]. One of the methods for developing optimal control laws for nonlinear systems is dynamic programming, which is also a mathematical basis for reinforcement learning as applied to the class of problems under consideration. However, it is well known that this approach suffers from the so-called “curse of dimensionality”, due to which it requires a large amount of memory and computing resources, which makes it practically unsuitable for solving real-world problems. The attempt to eliminate this serious drawback led to the development of Approximate Dynamic Programming (ADP) [7–12]. The proposed ADP approach significantly relies on the properties of neural networks on the approximation of nonlinear functions [4]. Based on the results obtained in solving this problem, the ACD approach based on the concept of adaptive critic, which exists in a large number of varieties [13–15] has been introduced and is actively developed up to the present time. An important special case of such a nonlinear control scheme based on the ACD approach is the ACD-LQR-NDI scheme, which is a generalization of the traditional LQR scheme [16], that is, the scheme for optimal control of linear dynamical system by the quadratic criterion. In this generalization, the linearization of the source nonlinear system is not based on the Taylor expansion approach, which is tied to a single mode of dynamical system operation. In order to cover the whole area of such modes, one has to apply Gain Scheduling (GS) [17,18], that is, to synthesize a set of LQR controllers and organize switching between them depending on the current mode of flight. The alternative is based on the NDI (Nonlinear Dynamic Inversion) algorithm [19,20], which performs precise linearization over the entire domain of operating modes. The second difference is that instead of the standard LQR synthesis based on traditional dynamic programming methods, an ACD algorithm tunable by machine learning techniques is implemented. In this variant there is no need in GS approach, because NDI gives a linear model suitable for use in the whole area of nonlinear system operating modes. One possible example of the application the ACD-LQR approach for solving applied problems is the control of the longitudinal angular motion of an aircraft under uncertainty due to incomplete and inaccurate knowledge of the
408
M. I. Chulin et al.
properties of the control object and the conditions under which it operates. The results of numerical experiments given below demonstrate the effectiveness of the ACD-LQR approach and allow us to evaluate its potential in solving real-world problems.
2 2.1
Dynamic Programming and Curse of Dimensionality Hamilton-Jacobi-Bellman Equation for Continuous-Time Systems
Consider the problem of optimal control of a continuous-time nonlinear dynamical system: (1) x(t) ˙ = F [x(t), u(t), t], t t0 , t ∈ R1 , where x ∈ Rn is state vector, u ∈ Rm is control vector, F (·) is nonlinear function specifying dynamics of the system. The efficiency of the system (1) is described by the following relation: ∞ U (x(τ ), u(τ ))dτ, (2) J(x(t)) = t
where U is utility function. Optimal transition cost for the system (1)–(2) J ∗ (x0 ) = min J(x0 , u(t)) u
(3)
satisfies the HJB equation −
2.2
∂J ∗ (x(t)) T ∂J ∗ (x(t)) = min{U (x(t), u(t), t) + F (x(t), u(t), t)} u∈U ∂t ∂x(t) ∂J ∗ (x(t)) T = U (x(t), u∗ (t), t) + F (x(t), u∗ (t), t). ∂x(t)
(4)
Curse of Dimensionality in Dynamic Programming
The Hamilton-Jacobi-Bellman equation (4) implements the Bellman optimality principle for continuous-time systems. If the system is linear and the criterion of its efficiency is quadratic, then we come to the standard LQR problem, the solution of which is reduced to the solution of the Riccati equation, for which there are well-developed algorithms. If the system is nonlinear or the criterion is nonquadratic, the problem of synthesizing a control law becomes sharply more complicated. Here we have to solve the HJB equation (4), which is a nonlinear partial differential equation. A similar situation arises in the case of a discrete-time nonlinear system when it becomes necessary to solve a nonlinear difference equation of the form J ∗ (x(k)) = min{U (x(k), u(k)) + γJ ∗ (x(k + 1))}. u(k)
(5)
ACD-LQR Approach to Aircraft Control
3
409
The Standard LQR Problem and Its Limitations
The standard LQR problem, for which there are well-developed algorithms for solution, is formulated as follows. The control object is described by a linear differential equation of the form: x˙ + ax = bu,
(6)
where a, b are matrices of object coefficients of dimension (n × n) and (n × m), respectively, they generally depend on time t (i.e. the object in general case is non-stationary); x, u are n-dimensional vector of states and m-dimensional vector of control, respectively. The control optimality criterion for this object is a quadratic functional that must be minimized: 1 t2 T 1 t2 T −2 1 (7) x βx dt + u k u dt . J(x, u) = xT (t2 )ρx(t2 ) + 2 t1 2 2 t1 terminal part state penalty control penalty The functional J(x, u) allows us to quantify the degree of deviation of the real motion from the specified, for example, program motion. In fact, these are requirements for the quality of transients, as well as for the consumption of control resources. In this functional ρ, β are given coefficient matrices of dimension (n × n); k 2 is given non-specific (m × m) matrix, k −2 is its inverse. In the case of a linear nonstationary control object (6) and a quadratic functional (7), the optimal control that minimizes this functional is defined by the expression (8) u = −k 2 bT Ax, where A is the solution to the matrix Riccati differential equation A˙ − Aa − AT a − Abk 2 bT A = −β
(9)
under the boundary condition A(t2 ) = ρ. The standard LQR problem is characterized by the following limitations: 1. The linearized model of the control object used to synthesize the control law in the standard LQR problem is “tied” to some fixed mode of operation of this object. 2. The control law in the standard LQR problem is not adaptive, that is, it cannot be adjusted online. The first of these limitations can be overcome by using Gain Scheduling (GS). A more radical option is to use feedback linearization to linearize the nonlinear control object. This option in the form of Nonlinear Dynamic Inversion (NDI) allows an exact (rather than approximate in the case of a Taylor series expansion) linearization of the object for the whole range of its modes of operation.
410
M. I. Chulin et al.
The second limitation cannot be bypassed within the standard ACOR problem. It means that under conditions of significant uncertainties (for example, due to failures and damages) the control law will lose its performance without the possibility of its operative recovery. Thus, the control laws constructed using traditional linear methods (including standard LQR) are not adaptive, which prevents their use in situations with high levels of uncertainty. This uncertainty, due to the incomplete and inaccurate knowledge of the properties of the control object (non-linear, in general case) and the environment in which it operates, as well as possible failures and damages that change the dynamic properties of this object. Consequently, there is a need for methods that are better able to cope with nonlinear systems and can provide control that is workable in the presence of the above uncertainties. One possible approach to implement adaptive and optimal control of nonlinear systems under uncertainty is reinforcement learning, which is, along with artificial neural networks, one of the areas of machine learning. This field is currently under active development, and the best effect is achieved when RLmethods are used in close interaction with ANN-methods.
4 4.1
Reinforcement Learning and Approximate Dynamic Programming Adaptive Critic Design in Approximate Dynamic Programming
As we know [3], the general scheme of reinforcement learning for some RLsystem SRL can be represented in the form shown in Fig. 1a. For this system, we introduce the notion of a policy π, defined as the mapping π : S → A. Let at some point in time t the system SRL (agent, in the RL-terminology) is in the state st ∈ S. It perceives a reward signal rt and takes an action at ∈ A determined by the strategy π, that is, at = π(st ). As a result, SRL moves to some next state st+1 = F (st , at ), obtaining a reward signal rt+1 = r(st , at , st+1 ) ∈ R. Figure 1b shows a variant of the general RL-scheme as applied to the problem of controlling dynamical systems. The mathematical basis of reinforcement learning as applied to control systems is the Bellman optimality principle and dynamic programming based on it. For this reason, the aforementioned “curse of dimensionality” has been a serious obstacle to solving real-world applied problems. An approach to overcome these difficulties was suggested by Paul Werbos [21] who formulated the idea of approximate dynamic programming (ADP) and showed some variants of realization of the idea. According to this idea, an element called ‘critic” is introduced into the system, which approximates an optimality criterion of a dynamic programming problem. On this basis, approximate solutions of the HJB-equation are found, and feedforward neural networks are involved. In some cases this approach is also called Adaptive Dynamic Programming, referring to the adaptivity of neural networks used to implement this approach. A part of the ADP approach is a class of methods called the Adaptive Critic Design (ACD) that are successfully used to form adaptive optimal control laws
ACD-LQR Approach to Aircraft Control
411
Fig. 1. Generalized reinforcement learning scheme
Fig. 2. General structure of the ACD-algorithm for adaptive control of dynamical systems
for various types of dynamical systems, including aircraft. The general scheme of the system implementing the ACD approach is shown in Fig. 2. In Fig. 2 the following designations are taken. The control object is described by a non-linear differential equation of the form: x˙ = f (x(t)) + g(x(t))u(t),
(10)
where x = (x1 , . . . , xn ) ∈ X are states and u = (u1 , . . . , um ) ∈ U are controls of the considered system.
412
M. I. Chulin et al.
The criterion of control efficiency is defined as a functional of the following form: ∞ F (x(τ ), u(τ ))dτ. (11) J(x(t), u(t)) = t
In this criterion, F (x, u) is an estimate of the control utility degree u under the system x state, which in the problems of the class under consideration is defined as (12) F (x(t), u(t)) = Q(x(t)) + uT (t)Ru(t). Given (11) and (12), the control goal implemented by the ACD algorithm is to obtain the optimal adaptive control law u∗ ∈ U with feedback minimizing the criterion J(x, u), that is ∞ J ∗ (x) = min F (x(τ ), u(τ ))dτ. (13) u∈U
t
As noted above, the optimization tool for control u is Bellman’s optimality principle. Accordingly, the components of the ACD-scheme shown in Fig. 2 implement the following functions: the critic for the moment of time t gives an estimate J(x, u) of the efficiency of the current variant of the control law; the agent implements the current control law and corrects it according to the estimate of the value of criterion J(x, u) obtained from the critic; the control object is the considered dynamic system with regard to influences on it by the environment. In most cases, the critic, the control law and the control object model are implemented as multilayer neural networks.
5
ACD-LQR Approach to Aircraft Control
As noted above, the LQR approach is a well-established tool for the formation of optimal control laws for dynamical systems based on the Bellman optimality principle. However, a significant limitation of this approach is the lack of adaptivity in the produced control law, which prevents its use in conditions of incomplete and inaccurate knowledge of the properties of the control object and the environment in which it operates. To overcome this limitation, it was proposed [22–25] to modify the standard version of the LQR approach using the ACD method, the related equations for which are given above. In this case, instead of solving the Riccati equation (9), required in the standard ACD, the so-called ACD-LQR-agent is introduced, which can adjust to changing properties as a control object, that is, it provides the adaptivity of the formed system. The generation of the ACD agent, i.e. the dynamical system control law, is carried out in accordance with the principles stated above. According to these principles, the agent seeks to minimize the cost of controlling the system without prior knowledge of matrices a and b in (6). The control law u(t) in terms of the RL approach is defined as the strategy π determined by the state x(t), i.e. π(x(t)) = u(t). The goal is to find the optimal strategy, that is, one that
ACD-LQR Approach to Aircraft Control
413
Q-function minimizes costs and maximizes reward. The required Q-function corresponding to the cost function J in the LQR-method is given as follows: N
γ t rt +1 |x0 = x(t), u0 = u(t) . Q(x(t), u(t)) = E
(14)
t =t
It should be noted that the Q-function in the continuous-time form, expressed in the form of an integral, corresponds to the problem in question, but in practical applications, a discretized approximation of this function, shown in (14), is usually used.
Fig. 3. The LQR-controller performs a desired pitch angle of 5 deg; flight mode h = 1000 m, M = 0.44
As an example of application for the ACD-LQR approach in comparison with the standard LQR, let us consider the problem of longitudinal angular motion control of a maneuvering aircraft. As a control object, for which the control using ACD-LQR agent was formed, we chose a maneuvering aircraft F-16, quite often used for testing various kinds of flight control algorithms. The short-period longitudinal aircraft motion simulation problem is discussed in the paper as the second example to demonstrate capabilities of the semiempirical simulation approach. This kind of motion is described traditionally by means of a system of ordinary differential equations (ODE) which can be written
414
M. I. Chulin et al.
Fig. 4. Trimming angle of attack stabilization for a nonlinear system with LQR Agent; flight mode h = 1000 m, M = 0.44
for example in the form [27,28]: g q¯S CL (α, q, ϕ) + , θ˙ = q − mV V q¯Sc q˙ = Cm (α, q, ϕ) , Jy
(15)
T 2 ϕ¨ = −2T ζ ϕ˙ − ϕ + ϕact , where θ is pitch angle, deg; q is pitch angular velocity, deg/sec; α is angle of attack, deg; ϕ is deflection angle of elevator, deg; CL is lift coefficient; Cm is pitching moment coefficient; m is mass of aircraft, kg; V is airspeed, m/sec; q¯ = ρV 2 /2 is airplane dynamic pressure; ρ is mass air density, kg/m3 ; g is acceleration of gravity, m/sec2 ; S is wing area of aircraft, m2 ; c is mean aerodynamic chord, m; Jy is pitching moment inertia, kg · m2 . Dimensionless coefficients CL and Cm are nonlinear functions of angle of attack; T , ζ are time constant and
ACD-LQR Approach to Aircraft Control
415
Fig. 5. Pitch step reference signal for linear system with LQR Agent; flight mode h = 1000 m, M = 0.44
Fig. 6. Pitch step reference signal for nonlinear system with LQR Agent; flight mode h = 1000 m, M = 0.44
416
M. I. Chulin et al.
Fig. 7. Response of an aircraft with LQR Agent to a step change in the desired pitch angle at different values of the time constant Tact for the elevator actuator; flight mode h = 1000 m, M = 0.44
relative damping factor for elevator actuator; ϕact is command signal value for the elevator actuator limited by ±25◦ . In the model (15), variables θ, q, ϕ and ϕ˙ are aircraft states, variable ϕact is aircraft control. The simulation results below have been obtained for the F-16 maneuverable aircraft as a control object. The values of parameters and characteristics of this aircraft required for the simulation are taken from [26]. All experiments were performed for the flight mode characterized by altitude h = 1000 m and Mach number M = 0.44. Figure 3 shows the response of the control object when transitioning to a specified pitch angle of 5 deg using the LQR controller. For the other results below, the LQR Agent is used as the controller. Figure 4 shows the stabilization processes of the trimming angle of attack for the nonlinear system. These processes provide the transition of the aircraft from various initial
ACD-LQR Approach to Aircraft Control
417
angles of attack to the horizontal flight mode. Figures 5 and 6 demonstrate the aircraft’s performance of a stepped pitch reference signal. The results of the LQR Agent capability evaluation under uncertainty are shown in Fig. 7. Here we simulate inaccurate knowledge of the parameter Tact which is the time constant characterizing the rate of operation for the elevator actuator. A change in the value of this parameter can also be interpreted as a situation where, due to a failure, the actuator speed has changed, which may be critical in terms of flight safety. All of these simulation results show that the LQR Agent used as a controller for the pitch control problem is in general quite satisfactory. What draws attention is insufficiently vigorous response to control inputs (transition time about 4 sec in all cases considered for small pitch angular velocities), which, generally speaking, is not typical for a maneuverable airplane. This situation can be changed by appropriate selection of weight coefficients in the criteria (7) and (12).
6
Conclusions
An important applied problem, which is to control the motion of aircraft under conditions of incomplete and inaccurate knowledge of the properties of the control object and the environment in which it operates, is poorly solvable by traditional methods. At the same time, as follows from the results presented above, these methods can be significantly enhanced by combining them with machine learning techniques. One such option is ACD-LQR, which combines the capabilities of LQR methods with reinforcement learning and neural network techniques. The presented simulation results allow us to evaluate the capabilities of this approach for solving the problems of the considered type. Further development of this approach provides a modification of LQR, based on the method of nonlinear dynamic inversion. Funding. The paper was prepared under the Program for the Development of the World-Class Research Center “Supersonic” in 2020–2025, funded by the Russian Ministry of Science and Higher Education (Agreement dated April 20, 2022, No. 075-152022-309).
References 1. Tao, G.: Adaptive Control Design and Analysis. Wiley, Hoboken (2003) 2. Astolfi, A., Karagiannis, D., Ortega, R.: Nonlinear and Adaptive Control with Applications. Springer, Berlin (2008). https://doi.org/10.1007/978-1-84800-066-7 3. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn., p. 548. The MIT Press, Cambridge (2018) 4. Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall, Upper Saddle River (2006)
418
M. I. Chulin et al.
5. Kamalapurkar, R., Walters, P., Rosenfeld, J., Dixon, W.: Reinforcement Learning for Optimal Feedback Control. Springer, Cham (2018). https://doi.org/10.1007/ 978-3-319-78384-0 6. Wei, Q., Song, R., Li, B., Lin, X.: Self-Learning Optimal Control of Nonlinear Systems: Adaptive Dynamic Programming Approach. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-4080-1 7. Powell, W.B.: Approximate Dynamic Programming: Solving the Curse of Dimensionality, 2nd edn. Wiley, Hoboken (2011) 8. Lewis, F.L., Liu, D. (eds.): Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. Wiley, Hoboken (2013) 9. Liu, D., Xue, S., Zhao, B., Luo, B., Wei, Q.: Adaptive dynamic programming for control: a survey and recent advances. IEEE Trans. Syst. Man Cybern. 51(1), 142–160 (2021) 10. Lewis, F.L., Vrabie, D.: Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circ. Syst. Mag. 9(3), 32–50 (2009) 11. Liu, D., Wei, Q., Wang, D., Yang, X., Li, H.: Adaptive Dynamic Programming with Applications in Optimal Control. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-50815-3 12. Song, R., Wei, Q., Li, Q.: Adaptive Dynamic Programming: Single and Multiple Controllers. Springer, Singapore (2019) 13. Ferrari, S., Stengel, R.F.: Online adaptive critic flight control. J. Guidance Control Dyn. 27(5), 777–786 (2004) 14. Wang, D., He, H., Liu, D.: Adaptive critic nonlinear robust control: a survey. IEEE Trans. Cybern. 47(10), 1–22 (2017) 15. Wang, D., Mu, C.: Adaptive Critic Control with Robust Stabilization for Uncertain Nonlinear Systems. Springer, Singapore (2019). https://doi.org/10.1007/978-98113-1253-3 16. Lewis, F.L., Vrabie, D.L., Syrmos, V.L.: Optimal Control, 3rd edn., p. 550. Wiley, Hoboken (2012) 17. Rugh, W.J., Shamma, J.S.: Research on gain scheduling: survey paper. Automatica 36(10), 1401–1425 (2000) 18. Leith, D.J., Leithead, W.E.: Survey of gain scheduling analysis and design. Int. J. Control 73(11), 1001–1025 (2000) 19. Enns, D., Bugajski, D., Hendrick, R., Stein, G.: Dynamic inversion: an evolving methodology for flight control design. Int. J. Control 59(1), 71–91 (1994) 20. Looye, G.: Design of robust autopilot control laws with nonlinear dynamic inversion. Automatisierungstechnik 49(12), 523–531 (2001) 21. Werbos, P.J.: A menu of designs for reinforcement learning over time. In: Miller, W.T., Sutton, R.S., Werbos, P.J. (eds.) Neural Networks for Control, pp. 67–95. MIT Press, Cambridge (1990) 22. Bradtke, S.J.: Reinforcement learning applied to linear quadratic regulation. In: Proceedings of NIPS-1992, pp. 295–302 (1992) 23. Faradonbeh, M.K.S., Tewari, A., Michailidis, G.: On adaptive linear-quadratic regulators. Automatica 117, 1–13 (2020) 24. Lee, J.Y., Park, J.B., Choi, Y.H.: Integral Q-learning and explorized policy iteration for adaptive optimal control of continuous-time linear systems. Automatica 48, 2850–2859 (2012) 25. Lee, J.Y., Park, J.B., Choi, Y.H.: On integral generalized policy iteration for continuous-time linear quadratic regulations. Automatica 50, 475–489 (2014)
ACD-LQR Approach to Aircraft Control
419
26. Nguyen, L.T., Ogburn, M.E., Gilbert, W.P., Kibler, K.S., Brown, P.W., Deal, P.L.: Simulator study of stall/post-stall characteristics of a fighter airplane with relaxed longitudinal static stability. NASA TP-1538, 223 p (1979) 27. Stevens, B.L., Lewis, F.L., Johnson, E.N.: Aircraft Control and Simulation: Dynamics, Controls Design and Autonomous Systems, 3rd edn., p. 764. Wiley, Hoboken (2016) 28. Cook, M.V.: Flight Dynamics Principles, 2nd edn., p. 496. Elsevier, Amsterdam (2007)
SNAC Approach to Aircraft Motion Control Yury V. Tiumentsev and Roman A. Tshay(B) Moscow Aviation Institute, National Research University, Moscow, Russia [email protected]
Abstract. Dynamic programming, as a basis for reinforcement learning, is a well-known method for synthesizing control laws for dynamical systems. However, this approach suffers from the so-called “curse of dimensionality,” due to which it is of limited suitability for solving realworld problems, in particular, aircraft flight control problems. One way to overcome this drawback is to use Approximate Dynamic Programming (ADP), which combines reinforcement learning and feed-forward neural networks. As one of ADP variants ACD (Adaptive Critic Design) approach has been introduced and actively continues to develop. It is based on the concept of adaptive critic and exists in a large number of varieties. One of these varieties is called SNAC (Single Network Adaptive Critic). The specific feature of the SNAC approach is the use of a single neural network of a critic to be trained, which reduces the consumption of resources for forming the required control law. In this case the absence of a actor network as a part of SNAC system is compensated by using a special kind of optimization algorithm. This paper analyzes the essence of the SNAC approach, as well as the features of its implementation as applied to the control of a nonlinear dynamical system under uncertainty conditions. The capabilities of this approach are demonstrated on the example of applied problem, in which the control law of longitudinal angular motion of a passenger aircraft is synthesized. The results allow us to evaluate the effectiveness of the SNAC approach, as well as to identify its elements that require further research and development. Keywords: aircraft · motion control, machine learning · reinforcement learning · approximate dynamic programming · adaptive critic design · SNAC approach
1
Introduction
Dynamic programming, as the basis of Reinforcement Learning (RL), is a wellknown technique for synthesizing control laws of dynamical systems (DS). However, this approach is known to suffer from the so-called “curse of dimensionality”
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 420–434, 2023. https://doi.org/10.1007/978-3-031-44865-2_45
SNAC Approach to Aircraft Control
421
[1], due to which it requires very large amounts of memory and computational resources. For this reason, it is of limited suitability for solving real-world applied problems, in particular, aircraft flight control problems. To overcome this disadvantage, an approach known as Approximate Dynamic Programming (ADP) [2–7] was proposed. The abbreviation ADP in some cases is also interpreted as Adaptive Dynamic Programming, referring to the possibilities of online adjustment of the obtained solutions. It should be noted that the ADP approach relies heavily on the properties of feedforward neural networks as a tool for approximating nonlinear functions [8]. As one of the variants of ADP, the ACD approach was introduced and is actively under development up to the present time. It is based on the concept of adaptive critic and exists in a large number of varieties [9–12]. One such variety is called SNAC (Single Network Adaptive Critic) [13– 17]. A prominent feature of the SNAC approach is the use of a single neural network to be trained, in contrast to more traditional ACD schemes, which include two networks (actor and critic), which increases the resource consumption for the formation of the required control law. The absence of the second network in a SNAC-system is compensated by the use of a special kind of optimization algorithm [21]. The attractive feature of the SNAC approach, as one of the varieties of ACD technology, is that it can be applied together with the NDI approach [19,20], that is, the method of nonlinear dynamic inversion, which provides an accurate linearization of the source nonlinear system in its whole range of operational modes. One of the weaknesses of the NDI approach is that in order to obtain a control law with its help, it is necessary to have at our disposal an accurate model of dynamical system as a control object. If, for any reason, the dynamic properties of the control object have changed, the control law synthesized for the object in its nominal state ceases to be adequate to the considered task. That is, the NDI scheme is not adaptive, which requires taking actions to compensate for this disadvantage. The use of NDI scheme together with SNAC algorithm is one of the possible options for solving this problem. In this variant SNAC algorithm provides adjustment of NDI controller parameters to compensate for changed dynamical system properties. In the next sections, the essence of the SNAC approach is considered, as well as the features of its implementation as applied to the control of a nonlinear dynamic system under uncertainty conditions. The capabilities of this approach are demonstrated on the example of a real-world application problem, in which the control law of lateral motion for a maneuvering aircraft is synthesized. The results obtained allow us to evaluate the effectiveness of the SNAC approach, as well as to identify its elements that require further research and development.
422
2 2.1
Y. V. Tiumentsev and R. A. Tshay
SNAC Scheme of ADP Approach for Control of Aircraft Motion Control Object Dynamics for ADP Approach
The nonlinear dynamics of the system under consideration is described, as usual in the case of ADP schemes, by a model in the state space: x˙ = f (x) + g(x)u, y = h(x),
(1)
where x ∈ Rn , u ∈ Rm and y ∈ Rp are states, controls and outputs of the dynamical system, respectively. Without loss of generality in the considered problem, we can assume that the regulation goal is to obtain a zero output of the DC with minimum control resource consumption within an infinite time horizon. This can be achieved by minimizing the following cost function: ∞ (y T Qy + uT Ru)dt J= 0 (2) ∞ = (hT (x)Qh(x) + uT Ru)dt. 0
Here Q 0 is a positive semi-defined matrix of weights of states of the dynamical system and R > 0 is a positive-defined matrix of weights of its controls. 2.2
Necessary Optimality Conditions for the ADP Approach
When we work with ADP-schemes in solving applied problems, the model with discrete time is more useful. It can be obtained, for example, from the model (1) using the Euler scheme with the sampling step Δt: xk+1 = xk + Δt[f (xk + g(xk uk )] = Fk (xk , uk ), yk = h(xk ),
(3)
where index k = 1, 2, . . . denotes the value of the corresponding variable at time tk . The discrete representation of the criterion (2) can be obtained in a similar way: ∞ 1 T (yk Qyk + uTk Ruk )Δt. (4) J= 2 k=1
In the terms introduced in [22], a utility function can be derived from the Eq. (3), which for time tk has the form: 1 T (y Qyk + uTk Ruk )Δt 2 k 1 = (hT (xk )Qh(xk ) + uTk Ruk )Δt. 2
ψk =
(5)
SNAC Approach to Aircraft Control
423
Then, according to [22], we can write the equation defining the desired optimal control: ∂ψ ∂F T k k + λk+1 = 0. (6) ∂uk ∂uk By substituting Fk from (3) and ψk from (5), this equation can be simplified: Ruk + [g(xk )]T λk+1 = 0.
(7)
Assuming that the matrix R is positively definite (that is, R−1 exists), we can express uk from (7): (8) uk = −R−1 [g(xk )]T λk+1 . It follows from expression (8) that to calculate control uk at time tk we need the value of the costate variable λk+1 for time tk+1 , which is calculated by inverse recursion, starting from the final time moment. In the ADP scheme under consideration, the costate equation has the following recursive form: ∂ψ ∂F T k k + λk = λk+1 . (9) ∂xk ∂xk Substituting Fk from (3) and ψk from (5) into (9), we obtain a simpler form of this relation: T ∂h(xk ) T ∂Fk λk = Δt Qh(xk ) + λk+1 . (10) ∂xk ∂xk 2.3
SNAC Control Law Synthesis Scheme
If we compare the above expression for optimal control (8) with the optimal control law for a linear time-invariant system of the form xk+1 = Axk + Buk [21] with quadratic criterion, we can conclude that the critic network implements the expression (11) λtk+1 = (I + SBR−1 B T )−1 SAxk , where S is the solution of the algebraic Riccati equation. The SNAC method extends (11) to the case of nonlinear systems, using the approximation properties of feed-forward neural networks [8]. In the SNAC-based control synthesis, the relationship between the system state xk and the costate λk+1 is reproduced using a feed-forward neural network (critic in the terms of the ACD approach). The structural diagram shown in Fig. 1 illustrates the procedure for obtaining training data to train the critic. The partially trained critic outputs a costate vector λk+1 as output with state vector xk as input. The resulting vector λk+1 is then substituted into the relationship for computing the optimal control (7) to obtain the optimal control vector uk . The state vector xk and control vector uk are then substituted into the state and costate equations given by the relations (3) and (10) respectively to obtain the target values for the costate vector λtk+1 . The critic is then trained using data on the states xk , the costates λk+1 , and the solutions uk obtained by solving the optimal control problem. The trained network predicts the optimal value λk+1 for a given xk . This value λk+1 serves as the basis for calculating the current control uk for time tk .
424
Y. V. Tiumentsev and R. A. Tshay
Fig. 1. Scheme of ANN-critic training in the SNAC-approach to dynamical system control
2.4
Generation of States to Initialize the Network of the Critic
The generation of states is an important part of the training procedure for SNAC. For this purpose, it is required to define the regions Si from which states can take values. In terms of the source object model (3), this condition takes the form Si = {xk |xk ∈ X}. The critic’s network must be trained with this condition in mind. It must be formulated in such a way that the elements of the set Si , i = 1, . . . , Ns cover as many points of the state space X in which the object trajectories are expected to lie. This is a non-trivial problem, but for the class of problems under consideration it is typical to synthesize controllers that transfer an object from some arbitrary state to the origin of coordinates. For this situation, in [18] the so-called was proposed. According to this method, for index values i = 1, 2, . . . it is required to determine the set Si such that Si = {xk : xk ∞ ci }, where ci is a positive constant. At first, small initial values c1 are fixed and the networks computing the variables λk+1 and λk+2 are trained using the states from S1 . After the convergence of the computational process is achieved, a new value of c2 is chosen such that the condition (c2 > c1 ) is satisfied. Training of the network continues until, for some i = Ns , the region SNs covers the required region of values of the state variables for the dynamical system under consideration. 2.5
Initializing the Network of the Critic
In order to train a critic’s network, a set of initial values of weights of this network must be generated. They are formed in the process, which can be interpreted as the pre-training of the considered network, performed for the linearized version
SNAC Approach to Aircraft Control
425
of the system (3). This process is based on solving the linear-quadratic regulator (LQR) synthesis problem [21,24]. Using this approach, the initial weights of the network are chosen so that the resulting control system operates almost identically to the LQR controller designed with respect to a given operating point [23]. Let us represent the linearized dynamical system in a discrete-time form: xk+1 = Ak xk + Bk uk ,
(12)
where Ak and Bk are generally time dependent. The discrete-time quadratic cost function is defined as follows: Ji =
N −1 T 1 T xN SN xN + xk Qk xk + uTk Rk uk . 2
(13)
k=1
The function Ji in (13) is defined on the time interval [i, N ]. This is necessary to determine the sequence of control actions that minimizes Ji . Then the Hamiltonian Hk at time k is defined as: Hk =
1 T x Qk xk + uTk Rk uk + λTk+1 Ak xk + Bk uk . 2 k
(14)
According to the stationarity condition for the optimal control ∂Hk = Rk uk + BkT λk+1 = 0, ∂uk
(15)
hence the optimal control is defined as uk = −Rk−1 BkT λk+1 ,
(16)
and the costates are defined by the relation λk =
∂Hk = Qk xk + ATk λk+1 . ∂xk
(17)
The critic network learns to approximate the relationship between xk and λk+1 . After estimating λk+1 , the value uk can be calculated using the control law (16). Substituting (16) into Eq. (12) leads to a relation of the form: xk+1 = Ak xk − Bk Rk−1 BkT λk+1 .
(18)
According to the sweep method [24] we can assume that the state xk and the costate λk are linearly related for all k N : λk = Sk xk .
(19)
Using Eqs. (12), (16) and (19) we can obtain xk+1 = (I + Bk Rk−1 BkT Sk+1 )−1 Ak xk ,
(20)
426
Y. V. Tiumentsev and R. A. Tshay
where Sk ∈ Rn×n is the Sylvester matrix obtained from the solution of the Riccati equation, defined by the relation Sk = (Ak − Bk Kk )T Sk+1 (Ak − Bk Kk ) + KkT Rk Kk + Qk .
(21)
To synthesize the controller we use the expression (21) for Sk and the gain matrix Kk obtained by solving the Riccati equation. The following relations must be satisfied: (22) uk = −Kk xk , where the time-dependent matrix Kk is defined as follows: Kk = (BkT Sk+1 Bk + Rk )−1 BkT Sk+1 Ak .
(23)
Applying Eq. (20) to (19) for time step k + 1, we obtain
and This uses 2.6
λtk+1
λtk+1 = Sk+1 xk+1
(24)
λtk+1 = Sk+1 (I + Bk Rk−1 BkT Sk+1 )−1 Ak xk .
(25)
from the Eq. (25) as the target value for pre-training the network.
Training the Critic Network
The result of the pre-training of the critic network is the initial value set of weights and biases of the neural network for the linear model (12). However, in the general case the system is nonlinear and time-dependent: xk+1 = F k xk + Gk xk uk . (26) The cost function is quadratic, as in the pretraining process, which leads to a Hamiltonian of the form: 1 (27) Hk = xTk Qk xk + uTk Rk uk + λTk+1 F k xk + Gk xk uk . 2 The necessary condition for the minimum of the Hamiltonian takes the form ∂Hk = 0. ∂uk
(28)
The procedure of differentiating the relation (27) by the control uk and solving the Eq. (28) gives us an expression for the optimal control in terms of state and cost variables: (29) uk = −Rk−1 Gk (xk )T λk+1 . Differentiating the expression (27) over the control Xk and resolving the resulting relation with respect to λk leads to an equation for the costate of the form: T ∂ (F k xk + Gk xk k ) λk+1 . (30) λk = Qk xk + ∂xk
SNAC Approach to Aircraft Control
427
The purpose of the SNAC algorithm is to train the neural network of the critic to determine the vector λk+1 , having the states xk as inputs. It follows that knowing λk+1 we can obtain the value of the optimal control from the Eq. (29). Using the above relations, we will get the learning algorithm for the critic’s network: 1. Generate Si . For each element xk of Si do the following : (a) Feed xk into a network of critics to get λk+1 = λak+1 . (b) Calculate uk , make an optimal control equation for known xk and λk+1 . (c) Get xk+1 from Eq. (26) using xk and uk . (d) For xk+1 and λk+2 , calculate λtk+1 using cost Eq. (30). 2. Train the critic network for all xk in Si ; the output will be their corresponding λtk+1 . 3. Check the convergence of the critic network (see next section). If convergence is achieved, return to step 2.6 for i = i + 1; otherwise, repeat steps 1 and 2. 4. Continue steps 1–3 of this process until the condition i = Ns is satisfied. 2.7
Verifying Convergence
To check the convergence of the training process of the critic network, we generate a new set of states Sic and target data, as described in the previous section. These target values are denoted as λtk+1 , then we denote the outputs from the trained network as λak+1 . As a criterion of convergence, we will use the admissible value of the relative error eacc . The current value of this error is determined by the relation: eccrr = λtk+1 − λak+1 / λtk+1 . It can be stated that the convergence of the training process of the critic’s network is achieved if for a given Sic the condition eccrr eacc is satisfied.
3
Using the SNAC Approach to Synthesize an Aircraft Motion Control Law
The capabilities of the SNAC approach for the synthesis of an aircraft motion control law we consider on the example of longitudinal angular motion. This kind of motion is described traditionally by means of a system of ordinary differential equations (ODE) which can be written for example in the form [25,26]: α˙ = q −
g q¯S CL (α, q, δe ) + , mV V
q¯Sc Cm (α, q, δe ) , Jy T 2 δ¨e = −2T ζ δ˙e − δe + δeact , q˙ =
(31)
where α is angle of attack, deg; q is pitch angular velocity, deg/sec; δe is deflection angle of elevator, deg; CL is lift coefficient; Cm is pitching moment coefficient;
428
Y. V. Tiumentsev and R. A. Tshay
m is mass of aircraft, kg; V is airspeed, m/sec; q¯ = ρV 2 /2 is airplane dynamic pressure; ρ is mass air density, kg/m3 ; g is acceleration of gravity, m/sec2 ; S is wing area of aircraft, m2 ; c is mean aerodynamic chord, m; Jy is pitching moment inertia, kg · m2 . Dimensionless coefficients CL and Cm are nonlinear functions of angle of attack; T , ζ are time constant and relative damping factor for elevator actuator; δeact is command signal value for the elevator actuator limited by ±25◦ . In the model (31), variables α, q, δe and δ˙e are aircraft states, variable δeact is aircraft control. The simulation results below were obtained for a medium-range passenger airplane as a control object. The values of parameters and characteristics of this aircraft required for modeling are taken from [27]. In particular, a linearized version of the model (31) is contained here for the considered case: α˙ −1.417 1 α 0 (32) = + δ . q˙ 2.86 −1.183 q −3.157 e The dynamics of the rudder elevator actuator is described by a differential equation of the form: T 2 δ¨e = −2T ζ δ˙e − δe + δeact . (33) The symbols in (32), (33) are the same as in (31). The problem to be solved by the SNAC controller is as follows. It is required to move the aircraft from some initial state α0 , ωz0 to a trimmed state with values αtrim and ωz = 0. That is, we need to switch to level flight mode, in which the angle of attack takes the trimming value of αtrim , and the pitch angular velocity becomes zero. The experiments were performed for various combinations of initial conditions (see Table 1), with the initial position of the elevator corresponding to the balancing one (Figs. 2 and 3). In this example, the SNAC algorithm implements the following set of relations: xk+1 =xk + Δt(Axk + Buk ), J=
N 1 k=1
2
(xTk Qxk + uTk Ruk )Δt,
T
(34)
λk = λk+1 + Δt Qxk + A λk+1 , uk = −B T R−1 λk+1 . The weight matrix Q in (34) was chosen as a unit matrix of dimension 2 × 2, while R = 1. In all experiments, the critic network consists of two subnetworks with two hidden layers of 6 neurons in each with a hyperbolic tangent as the activation function. The output layer of these subnetworks contains a single neuron with a linear activation function. In addition, the SNAC controller was compared with the LQR controller for the initial state α0 = 8◦ , ωz0 = 10◦ /s. The simulation results for this case are shown in Figs. 4 and 5.
SNAC Approach to Aircraft Control
429
Table 1. Values of initial states for performed experiments α, deg q, deg/sec 1 10
18
2 6
7
3 −6
−6
4 −12
−11
Another series of experiments was related to simulating a actuator failure by varying its time constant. The results of these experiments are presented in Figs. 6 and 7.
Fig. 2. Angle of attack transients for different initial conditions
The analysis of the obtained results allows us to draw the following conclusions. The controller implementing the SNAC-algorithm solves the problem with high efficiency. The transient times for angle of attack and angular velocity are within 2.5 to 3.5 s, the overshoot is small, the transients are close to aperiodic. Comparison of SNAC controller with LQR controller, which implements the optimal control law, shows that the former is not inferior to the latter, and in a number of cases it even exceeds the latter. At the same time, the SNAC controller demonstrates a good level of robustness, which allows it to successfully overcome failure situations, expressed in the degradation of the characteristics of the rudder elevator actuator. The next stage in the evolution of the SNAC approach is to replace the linearized motion model (12), which is tied to a single mode of operation of the
430
Y. V. Tiumentsev and R. A. Tshay
Fig. 3. Pitch rate transients for different initial conditions
Fig. 4. Comparison of transients by angle of attack for SNAC and LQR
SNAC Approach to Aircraft Control
Fig. 5. Comparison of pitch rate transients for SNAC and LQR
Fig. 6. Angle of attack transients at different actuator time constants
431
432
Y. V. Tiumentsev and R. A. Tshay
Fig. 7. Pitch rate transients at different actuator time constants
control object, with an NDI subsystem, which will provide accurate linearization for this object in the whole range of its modes.
4
Conclusions
There is an important applied problem, which is to control the motion of aircraft under conditions of incomplete and inaccurate knowledge about the properties of the control object and the environment in which it operates. This problem is poorly handled by traditional methods. At the same time, these methods can be significantly strengthened by combining them with machine learning methods. One possibility is approximate dynamic programming, which combines reinforcement learning and feed-forward neural networks. A significant number of ADP variants are based on the concept of the adaptive critic and exist in a large number of varieties. One of such varieties is SNAC, the specific feature of which is the implementation of the critic as a feed-forward neural network, while the actor has the form of an optimization algorithm of a special kind. This approach reduces the resource consumption for the formation of the required control law. Capabilities of this approach are demonstrated on the example of a real applied problem, in which the control law of longitudinal angular motion of a passenger aircraft is synthesized. The results allow us to evaluate the effectiveness of the SNAC approach, as well as to identify its elements that require further research and development. As these results demonstrate, the controller implementing the SNAC algorithm solves the problem with high efficiency. The response times of the pitch angle and angle velocity transients as well as the overshooting are small, the transients are
SNAC Approach to Aircraft Control
433
close to aperiodic. We can also see that the SNAC controller is not inferior to the LQR controller which implements the optimum control law. At the same time, the SNAC-regulator shows a good level of robustness, allowing it to successfully cope with failure situations. It can also be seen that the LQR-regulator implementing the optimal control law is not superior to the SNAC-regulator. At the same time, the SNAC controller demonstrates a good level of robustness, which allows it to successfully cope with failure situations. The next stage in the development of the SNAC approach is to replace the linearized motion model (12), which is tied to a single operating mode of the control object, with an NDI subsystem, which will provide accurate linearization for this object in the whole range of its modes. Funding. The paper was prepared under the Program for the Development of the World-Class Research Center “Supersonic” in 2020–2025, funded by the Russian Ministry of Science and Higher Education (Agreement dated April 20, 2022, No. 075-152022-309).
References 1. Powell, W.B.: Approximate Dynamic Programming: Solving the Curse of Dimensionality, 2nd edn. Wiley, Hoboken (2011) 2. Lewis, F.L., Liu, D. (eds.): Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. Wiley, Hoboken (2013) 3. Liu, D., Xue, S., Zhao, B., Luo, B., Wei, Q.: Adaptive dynamic programming for control: a survey and recent advances. IEEE Trans. Syst. Man Cybern. 51(1), 142–160 (2021) 4. Lewis, F.L., Vrabie, D.: Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circ. Syst. Mag. 9(3), 32–50 (2009) 5. Liu, D., Wei, Q., Wang, D., Yang, X., Li, H.: Adaptive Dynamic Programming with Applications in Optimal Control. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-50815-3 6. Song, R., Wei, Q., Li, Q.: Adaptive Dynamic Programming: Single and Multiple Controllers. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-1712-5 7. Wei, Q., Song, R., Li, B., Lin, X.: Self-Learning Optimal Control of Nonlinear Systems: Adaptive Dynamic Programming Approach. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-4080-1 8. Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall, Upper Saddle River (2006) 9. Werbos, P.J.: A menu of designs for reinforcement learning over time. In: Miller, W.T., Sutton, R.S., Werbos, P.J. (eds.) Neural Networks for Control, pp. 67–95. MIT Press, Cambridge (1990) 10. Ferrari, S., Stengel, R.F.: Online adaptive critic flight control. J. Guidance Control Dyn. 27(5), 777–786 (2004) 11. Wang, D., He, H., Liu, D.: Adaptive critic nonlinear robust control: a survey. IEEE Trans. Cybern. 47(10), 1–22 (2017) 12. Wang, D., Mu, C.: Adaptive Critic Control with Robust Stabilization for Uncertain Nonlinear Systems. Springer, Singapore (2019). https://doi.org/10.1007/978-98113-1253-3
434
Y. V. Tiumentsev and R. A. Tshay
13. Lakshmikanth, G.S., Padhi, R., Watkins, J.M., Steck, J.E.: Adaptive flight-control design using neural-network-aided optimal nonlinear dynamic inversion. J. Aerosp. Inf. Syst. 11(11), 785–806 (2014) 14. Lakshmikanth, G.S., Padhi, R., Watkins, J.M., Steck, J.E.: Single network adaptive critic aided dynamic inversion for optimal regulation and command tracking with online adaptation for enhanced robustness. Optimal Control Appl. Methods 35, 479–500 (2014) 15. Steck, J.E., Lakshmikanth, G.S., Watkins, J.M.: Adaptive critic optimization of dynamic inverse control. In: 2012 AIAA Infotech and Aerospace Conference, Garden Grove, California, USA, p. 2408, 21 p. AIAA (2012) 16. Heyer, S.: Reinforcement learning for flight control: learning to fly the PH-LAB. M.S. thesis, Delft University of Technology, 126 p (2019) 17. Teirlinck, C.: Reinforcement learning for flight control: hybrid offline-online learning for robust and adaptive fault-tolerance. M.S. thesis, Delft University of Technology, 153 p (2022) 18. Padhi, R., Unikrishnan, N., Wang, X., Balakrishnan, S.N.: A single network adaptive critic (SNAC) architecture for optimal control synthesis for a class of nonlinear systems. Neural Netw. 19, 1648–1660 (2006) 19. Enns, D., Bugajski, D., Hendrick, R., Stein, G.: Dynamic inversion: an evolving methodology for flight control design. Int. J. Control 59(1), 71–91 (1994) 20. Looye, G.: Design of robust autopilot control laws with nonlinear dynamic inversion. Automatisierungstechnik 49(12), 523–531 (2001) 21. Lewis, F.L., Vrabie, D.L., Syrmos, V.L.: Optimal Control, 3rd edn., p. 550. Wiley, Hoboken (2012) 22. Werbos, P.J.: Approximate dynamic programming for real-time control and neural modeling. In: Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, Van Nostrand Reinhold, New York, USA (1992) 23. Nobleheart, W.G., Geethalakshmi, S.L., Chakravarthy, A., Steck, J.: Single network adaptive critic (SNAC) architecture for optimal tracking control of a morphing aircraft during a pull-up maneuver. In: AIAA 2013, p. 5004, 18 p (2013) 24. Bryson, A.E., Ho, Y.-C.: Applied Optimal Control: Optimization, Estimation and Control (1975) 25. Stevens, B.L., Lewis, F.L., Johnson, E.N.: Aircraft Control and Simulation: Dynamics, Controls Design and Autonomous Systems, 3rd edn., p. 764. Wiley, Hoboken (2016) 26. Cook, M.V.: Flight Dynamics Principles, 2nd edn., p. 496. Elsevier, Amsterdam (2007) 27. Blakelock, J.H.: Automatic Control of Aircraft and Missiles. Wiley, New York (1965)
Generating Generalized Abstracts Using a Hybrid Intelligent Information System for Analysis of Judicial Practice of Arbitration Courts Maria O. Taran, Georgiy I. Revunkov, and Yuriy E. Gapanyuk(B) Bauman Moscow State Technical University, Moscow, Russia [email protected]
Abstract. The article is devoted to the description of the operation of the hybrid intelligent information system for analysis of judicial practice of arbitration courts. The structure of a generalized abstract of a judicial act is considered. The essential factor is that lawyers perceive a verbatim quote much better, whether it be a sentence or a whole paragraph, than a generated text; in the latter case, they have no confidence that the essence is reflected correctly. Existing solutions to the problem of abstract generation of a judicial act are briefly reviewed. The proposed solution is based on the hybrid intelligent information system for analysis of judicial practice of arbitration courts. The main elements of the system are the subsystem of subconsciousness (SbS) and the subsystem of consciousness (SbC). The role of the environment is performed by the texts of judicial acts that can be submitted to the system. Depending on the number of input documents, a judicial act (one document) and judicial practice (several documents) can be distinguished. The subsystem of subconsciousness includes the following modules: the preprocessing module, the feature extraction module, the module of clustering and grouping of judicial acts, the paragraph classification module. The subsystem of consciousness includes the following modules: the module for compiling a summary of a judicial act, the judicial practice analysis module, the report generation module. The results of the experiments are described in the corresponding section. Keywords: Arbitration Court · Judicial Act · Text Summarization · Text Mining · Text Clustering · Text Analytics · Deep Learning · Hybrid Intelligent Information System for Analysis of Judicial Practice of Arbitration Courts
1 Introduction The analysis of judicial practice can be viewed from several angles. First, these are statistical indicators, such as the number of cases considered over a period, the number of satisfied claims, the number of decisions overturned in higher instances, and similar indicators. Secondly, these are reviews of judicial practice. The official review of judicial practice is published several times a year by the Supreme Court of the Russian Federation, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 435–444, 2023. https://doi.org/10.1007/978-3-031-44865-2_46
436
M. O. Taran et al.
most often quarterly. These are quite voluminous documents that can contain about one hundred and fifty pages. They contain examples from short versions of judicial acts, grouped on certain issues. Several examples may be given, as well as references to other judicial acts. Thirdly, this is the formation of a legal position and the generalization of judicial practice on certain issues. Most often, based on the results of the generalization, an analytical or any other reference is created. It is the second and third options for the analysis of judicial practice that are currently being performed manually by employees of different organizations. This is a very time-consuming process, for the automation of which the use of hybrid intelligent information system for analysis of judicial practice of arbitration courts [1] is proposed. This article will consider the process of generating generalized abstracts using this system.
2 The Structure of a Generalized Abstract In accordance with GOST R 7.0.99-2018, an abstract is an accurate and concise summary of the content of the primary document in text form. At the same time, it is not allowed to convey any conclusions and opinions of the author, indicate critical remarks and other similar points. It is assumed that scientific articles, some studies and other similar materials are most often referred to. Therefore, in the above GOST, the structure of the abstract is given specifically for such texts. For example, it is proposed to indicate the methodology of the work, the results of the work, factual data, important discoveries, etc. Of course, for legal non-scientific texts, such a structure cannot be fully applied. At the same time, lawyers, like scientists, need to constantly read a large volume of new texts. Since law enforcement practice does not stand still and over time certain norms may not only be abolished, but interpreted quite differently. Judicial acts on average can take from three pages to fifteen. For individual topics, the average volume of pages reaches twenty to thirty pages. In order to draw up any legal position, it is necessary to familiarize yourself with a significant number of documents. Currently, this entails very large time costs or, even worse, the study of a minimum number of judicial acts. Therefore, without familiarization with brief versions of judicial acts, it is quite difficult to keep your knowledge and expertise up to date. A brief overview of the judicial act is a monographic specialized abstract containing the main provisions and conclusions of the court, which can be useful not only for participants in a particular dispute, but also for other persons. Reading such a review will allow specialists to get acquainted only with the main motives of the court without secondary information, which will significantly reduce the time of such specialists. If desired or necessary, you can always study the full version of the original document. Approximate structure of the summary of the judicial act: • • • •
Short name of the document, case number, court instance; The decision taken; The main conclusions and motives of the court; Regulatory legal acts referred to by the court.
Generating Generalized Abstracts Using a Hybrid Intelligent Information
437
Unlike other types of texts in legal documents, when preparing an abstract, it is very important not to miss the nuances of the presentation. Since the slightest change in the order of words or replacing them with synonyms can change the meaning, which is highly undesirable. Lawyers perceive a verbatim quote much better, whether it be a sentence or a whole paragraph, than a generated text. In the latter case, they have no confidence that the essence is reflected correctly. The authors of [2] conducted a small study on the quality of their referencing system. As a result, the following conclusions were made: • In abstracts it is desirable to have sentences from different sections of the judicial act. • The amount of duplicate and redundant information should be reduced. The universal abstract can be for reference only. In practice, abstracts will be more useful, in which lawyers can choose the types of information they need. For example, some specialists need facts, others are only interested in the result, and others are interested in the arguments of the parties.
3 Existing Solutions to the Problem In [3], statistical features, keywords, and conditional random fields (CRF) are used to summarize documents. On their basis, proposals are evaluated and a final summary is compiled. In [4], a rule-based knowledge base, which was compiled using the Ripple Down Rules (RDR) methodology, plays a key role. The methods were tested on the decisions of the courts of the Australian judiciary. In [5], proposals are evaluated for each topic determined by the Latent Dirichlet Allocation (LDA) method on the basis of the entire corpus of documents. In [6], a graph-based method is proposed for identifying important sentences. In [7], it is proposed to use a bidirectional LTSM network to classify the rhetorical role of a sentence. Testing of the methods was carried out on judicial decisions of the Supreme Court of India. In [8], the authors explore various natural language processing methods to extract key information and facts from Russian court decisions related to administrative offenses. The paper [9] proposes an abstraction method based on integer programming (ILP). It is also proposed to use the recommendations of experts to create abstracts. The authors conducted a comparison with several modern general purpose abstracting systems, including those based on BERT [10]. They concluded that such systems are less suitable for solving the problem than specialized ones. The authors of the article [7] in article [11] investigated fifteen algorithms for abstracting documents and concluded that algorithms based on BERT and RNN performed best. In the article [12], they also explore various LLM models for abstracting Indian jurisprudence and conclude that “abstractive summarization models and LLMs are not yet ready for fully automatic deployment for case judgement summarization”. The authors of [13] investigated BERT-based models that were applied to a certain type of cases in relation to judicial acts from Indonesia.
438
M. O. Taran et al.
4 The Proposed Solution to the Problem The creation of a brief overview of a judicial act is one of the reports that can be generated by a “hybrid intelligent information system for analysis of judicial practice of arbitration courts” (which we will call the “System” below). This system is based on HIIS methodology [14]. The work of the main modules of the system was considered in detail in articles [1, 15, 16]. Therefore, in this section, we will focus more on the operation of the system as a whole.
Fig. 1. The architecture of the System.
The main function of the System is to display input documents in their abbreviated versions (abstracts). An additional function is the generation of statistical reports on judicial practice. In the process of researching and creating the System, a method was proposed for summarizing a judicial act or creating a brief overview of a judicial act, based on the extraction of paragraphs from the source text. The main steps of this method are: • • • •
Identification of useful parts of the document. Identification of important paragraphs. Definition of document details. Drafting a summary of the judicial act.
Generating Generalized Abstracts Using a Hybrid Intelligent Information
439
This method is implemented in the System by several modules. The basis of the method of summarizing a judicial act is the definition of important paragraphs in terms of their applicability in practice. So, the importance of a particular paragraph is subjective and is determined by the task that the specialist faces. It would be more optimal to provide the specialist with the opportunity to choose for the abstract the information that suits him at a given time. If a specialist is only interested in the positions of the court, evidence or references to legal acts, then this information should be shown to him in the report. To do this, the paragraphs in the document must have a certain label or class, which shows what main idea or thoughts the paragraph includes. The structure of the system is shown in Fig. 1. The preliminary structure of the System was proposed in work [1], in this article we consider an improved version of the structure of the System. The main elements of the System are the subsystem of subconsciousness (SbS) and the subsystem of consciousness (SbC). The role of the environment is performed by the texts of judicial acts that can be submitted to the System. Depending on the number of input documents, a judicial act (one document) and judicial practice (several documents) can be distinguished. The subsystem of subconsciousness performs the main tasks related to interaction with the environment. For example, it converts source documents, extracts attributes, classifies paragraphs, groups documents in judicial practice. Basically, all these tasks are solved using soft computing, which provides an intelligent component of the System. The subsystem of consciousness, on the contrary, performs the task of creating a particular report based on rules and classical algorithms to a greater extent. Here the task of summarizing the judicial act, as well as the analysis of judicial practice is solved. The SbC is also responsible for monitoring the data coming from the SbS and interacting with the user. Since the results of the SbC’s work are actually some recommendations that require additional reflection and actions from the user-lawyer, we can say that it performs some functions of a decision support system (DSS). It should be noted that the rules for generating an abstract or report can be probabilistic in nature. The classical OOP model is used as a data model in SbC. This allows you to use proven tools and approaches to the development, storage and analysis of data. The subsystem of subconsciousness includes the following modules: • • • •
The preprocessing module. The feature extraction module. The module of clustering and grouping of judicial acts. The paragraph classification module.
The preprocessing module implements the main functions of text preprocessing: text cleaning, text tokenization, word normalization, PoS tagging. The feature extraction module implements functions for extracting features from text and paragraphs. The module of clustering and grouping of judicial acts is described in detail in [16]. It is used only in the analysis of judicial practice, i.e., if more than one document has been uploaded to the System. In accordance with the proposed method, the main functions of this module are: extracting features from judicial acts, clustering documents, grouping
440
M. O. Taran et al.
documents. The input of the module is a set of judicial acts, and the output is several sets with grouped documents. A separate group is represented by documents that were not included in other groups. The paragraph classification module described in details in [15] is responsible for automatic labeling of paragraphs for further use in summarizing a judicial act. Identifying important paragraphs in a document is a multi-class classification problem. In total, the module can define nine main paragraph classes and one additional class for ambiguous prediction. It is assigned when it is not possible to unambiguously determine belonging to any class. In fact, this class finds examples of paragraphs that have not previously been encountered in the training sample or do not contain key elements for a qualitative classification. This class requires mandatory manual markup. Due to the fact that ensembles (algorithmic compositions) of algorithms can demonstrate better generalization ability than individual algorithms, an ensemble of several algorithms is used in the module. To build an algorithmic composition, there are a sufficient number of well-studied methods, for example, AdaBoost, etc. Some of them involve automatic composition creation, others use manual algorithms. In this research, the second option for creating a composition was chosen, where each of the basic algorithms was trained to solve the classification problem, and the selection of composition parameters is carried out for a corrective operation. We use a hybrid approach for paragraph classification, namely several independent models and algorithms to improve the quality of the final classification. This allows even paragraph classes to be predicted with a small number of examples. The following basic steps are used to classify paragraphs. In the first step, the source text is converted into a convenient format. For example, extra spaces, empty paragraphs, etc. are removed (the preprocessing module). At the second step, the motivational part is divided into paragraphs. In this case, all tables are deleted, and the lists turn into a single paragraph (the feature extraction module). At the third step, additional features are extracted from the paragraphs, such as dates, amounts, rules of law, etc. Then each paragraph is encoded using classical TF-IDF. For individual models, additional feature extraction is used, for example, template (keyword) extraction, embedding layers, tokenizers for BERT. Some of these features are extracted using “the feature extraction module”, while others are specific models that are related to “the paragraph classification module”. In the fourth step, all paragraphs are classified using “the paragraph classification module”. Paragraph marks are placed in the source text. Depending on the parameters, these can be either special words or highlighting. The result is uploaded as a dictionary for use in other system modules. It is also possible to upload marked-up text to a text file. The subsystem of consciousness includes the following modules: • The module for compiling a summary of a judicial act. • The judicial practice analysis module. • The report generation module. The module for compiling a summary of a judicial act is designed to create a short version (abstract) of a judicial act, depending on the data required by a specialist. The
Generating Generalized Abstracts Using a Hybrid Intelligent Information
441
document and the necessary types of paragraphs are supplied as input to the module, and the output is an abbreviated text of the document. The judicial practice analysis module works only for judicial practice as a whole, and not for each document separately. It includes the definition of statistics, as well as the preparation of an overview and summary of judicial practice. The report generation module is responsible for the format and presentation of the corresponding report, which can be shown to a specialist or transferred to another system. The interaction between SbC and SbS is carried out directly without the use of a thirdparty data storage. In this case, the storage of intermediate data is organized separately in each module. The user can only interact with the SbC, since the SbS does not provide any directly useful results for the user.
5 Experiments As part of this study, experiments were conducted to compare the operation of the paragraph classification module with modern neural network models. Some of these models were trained, including for the generation of abstracts. However, the requirements for legal documents do not allow full use of the abstracts that can be generated by these neural networks. Therefore, they were used only to solve the problem of classifying paragraphs. Due to the increased requirements for equipment at the step of training models, the option of training models only for solving the classification problem was chosen. An output layer was added to classify paragraphs, the rest of the model weights were fixed (fine-tuning). The default parameters were used, which are predefined in the models for classifying sequences on the Hugging Face. Testing was carried out on a small dataset, which was described in [16]. The comparison was carried out for neural networks ruBert, ruRoberta, ruGPT3, FRED-T5. All neural networks were pre-trained for the Russian language by the SberDevices team and posted by Hugging Face [17]. The results of the experiments are presented in Table 1. Table 1. Models comparison for paragraphs classification Models
Precision
Recall
F1-score
ruBert (fine-tuned)
0.45
0.67
0.54
ruGPT3 (fine-tuned)
0.56
0.70
0.59
ruRoberta (fine-tuned)
0.50
0.32
0.37
FRED-T5 (fine-tuned)
0.45
0.67
0.54
ruBert (train all weights)
0.82
0.84
0.82
Text Fragment Extraction Module
0,91
0,82
0,86
It can be seen from the Table 1 that the proposed paragraph classification module showed better quality compared to neural network models. Most likely, by increasing
442
M. O. Taran et al.
the number of examples in the dataset, as well as changing the number of layers and their parameters, you can achieve better quality indicators. Another option for improvement could be to train the entire model to solve the classification problem, and not just its output layer. This hypothesis is partly confirmed by the result of the ruBert (train all weights) model, which was trained with unlocked weights. The table shows that its indicators are approaching the paragraph classification module, slightly yielding to it. To check the quality of summarizing a judicial act, a small dataset was compiled based on official reviews of judicial practice made by the Supreme Court of the Russian Federation. The dataset contains about 200 examples. It is worth noting that examples from these documents were not taken to classify paragraphs. The calculations were carried out for the classification of paragraphs, only the paragraph classification module was used, since the tested models did not show the required generalizing ability. Rouge-1, Rouge-2, Rouge-4 were used to measure the quality of referencing. The results of measuring the quality of the referencing module are shown in Table 2. Table 2. Indicators of the quality of summarizing a judicial act Precision
Recall
F1-score
Rouge-1
0.81
0.65
0.71
Rouge-2
0.72
0.56
0.63
Rouge-4
0.66
0.51
0.58
One can improve the quality of abstracting by selecting parameters that are responsible for the type of paragraphs and the number of each type. However, it is worth noting that the generation of abstracts, the parameters of which are chosen by the users themselves. This allows receiving individual abstracts depending on the current tasks of the lawyer.
6 Conclusions An abstract is an accurate and concise summary of the content of the primary document in text form. Lawyers, like scientists, need to constantly read a large volume of new texts. When preparing an abstract, it is very important not to miss the nuances of the presentation. Since the slightest change in the order of words or replacing them with synonyms can change the meaning, which is highly undesirable. The hybrid intelligent information system for analysis of judicial practice of arbitration courts is proposed as a solution to the problem which is based on the hybrid intelligent information systems methodology. The subsystem of subconsciousness SbS performs the main tasks related to interaction with the environment. It includes the following modules: the preprocessing module,
Generating Generalized Abstracts Using a Hybrid Intelligent Information
443
the feature extraction module, the module of clustering and grouping of judicial acts, the paragraph classification module. The subsystem of consciousness SbC performs the task of creating a particular report based on rules and classical algorithms. It includes the following modules: the module for compiling a summary of a judicial act, the judicial practice analysis module, the report generation module. The results of the experiments show that, in general, the system copes with its purpose, while there is a need to further improve the quality of its work.
References 1. Taran, M.O., Revunkov, G.I., Gapanyuk, Y.E.: The text fragment extraction module of the hybrid intelligent information system for analysis of judicial practice of arbitration courts. In: Kryzhanovsky B., Dunin-Barkowski W., Redko V., Tiumentsev Y. (eds.) Advances in Neural Computation, Machine Learning, and Cognitive Research IV. NEUROINFORMATICS 2020. SCI, vol. 925, pp. 242–248. Springer, Cham (2021). https://doi.org/10.1007/978-3-03060577-3_28 2. Polsley, S., Jhunjhunwala, P., Huang, R.: CaseSummarizer: a system for automated summarization of legal texts. In: Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, Osaka, Japan, pp. 258–262 (2016) 3. Saravanan, M., Ravindran, B., Raman, S.: Automatic identification of rhetorical roles using conditional random fields for legal document summarization. In: Proceedings of the Third International Joint Conference on Natural Language Processing, vol. I (2008) 4. Galgani, F., Compton, P., Hoffmann, A.: Combining different summarization techniques for legal text. In: Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data, pp. 115–123 (2012) 5. Kumar, V.R., Raghuveer, K.: Legal document summarization using latent dirichlet allocation. Int. J. Comput. Sci. Telecommun. 3(7), 114–117 (2012) 6. Kim, M.Y., Xu, Y., Goebel, R.: Summarization of legal texts with high cohesion and automatic compression rate. In: Motomura, Y., Butler, A., Bekki, D. (eds.) New Frontiers in Artificial Intelligence. JSAI-isAI 2012. LNCS, vol. 7856, pp. 190–204. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39931-2_14 7. Bhattacharya, P., Paul, S., Ghosh, K., Ghosh, S., Wyner, A.Z.: Identification of Rhetorical Roles of Sentences in Indian Legal Judgments. International Conference on Legal Knowledge and Information Systems. arXiv:1911.05405 (2019) 8. Metsker, O., Trofimov, E., Grechishcheva, S.: Natural language processing of Russian court decisions for digital indicators mapping for oversight process control efficiency: disobeying a police officer case. In: Chugunov, A., Khodachek, I., Misnikov, Y., Trutnev, D. (eds.) Electronic Governance and Open Society: Challenges in Eurasia. EGOSE 2019. CCIS, vol. 1135, pp. 295–307. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-39296-3_22 9. Bhattacharya, P., Poddar, S., Rudra, K., Ghosh, K., Ghosh, S.: Incorporating domain knowledge for extractive summarization of legal case documents. arXiv:2106.15876 (2021) 10. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 (2019) 11. Deroy, A., Ghosh, K., Ghosh, S.: Ensemble methods for improving extractive summarization of legal case judgements. Artif. Intell. Law (2023). https://doi.org/10.1007/s10506-023-093 49-8 12. Deroy, A., Ghosh, K., Ghosh, S.: How Ready are Pre-trained Abstractive Models and LLMs for Legal Case Judgement Summarization? arXiv:2306.01248 (2023)
444
M. O. Taran et al.
13. Wicaksono, G.W., Al asqalani, S.F., Azhar, Y., Hidayah, N.P., Andreawana, A.: Automatic summarization of court decision documents over narcotic cases using BERT. JOIV: Int. J. Inform. Vis. (2023). https://doi.org/10.30630/joiv.7.2.1811 14. Chernenkiy, V., Gapanyuk, Y., Terekhov, V., Revunkov, G., Kaganov, Y.: The hybrid intelligent information system approach as the basis for cognitive architecture. Procedia Comput. Sci. 145, 143–152 (2018) 15. Taran, M.O., Revunkov, G.I., Gapanyuk, Y.E.: Generating a summary of a court act based on an improved text fragment extraction module. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y., Klimov, V.V. (eds.) Advances in Neural Computation, Machine Learning, and Cognitive Research V. NEUROINFORMATICS 2021. SCI, vol. 1008, pp. 292– 298. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-91581-0_39 16. Taran, M.O., Revunkov, G.I., Gapanyuk, Y.E.: Creating a brief review of judicial practice using clustering methods. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) Advances in Neural Computation, Machine Learning, and Cognitive Research VI. NEUROINFORMATICS 2022. SCI, vol. 1064, pp. 466–475. Springer, Cham (2023). https:// doi.org/10.1007/978-3-031-19032-2_48 17. ai-forever. https://huggingface.co/ai-forever. Accessed 15 May 2023
Integration of Data from Various Physical Methods in Solving Inverse Problems of Spectroscopy by Machine Learning Methods Artem Guskov1,2(B) , Igor Isaev1,3 , Sergey Burikov1,2 , Tatiana Dolenko1,2 Kirill Laptinskiy1 , and Sergey Dolenko1
,
1 D.V. Skobeltsyn Institute of Nuclear Physics, M.V. Lomonosov Moscow State University,
Moscow, Russia [email protected], [email protected] 2 Faculty of Physics, M.V. Lomonosov Moscow State University, Moscow, Russia 3 Kotelnikov Institute of Radio Engineering and Electronics, Russian Academy of Sciences, Moscow, Russia
Abstract. This article presents the results of solving an inverse problem in spectroscopy using integration of optical spectroscopy methods. The studied inverse problem is determining the concentrations of heavy metal ions in multicomponent solutions by Raman spectra, infrared spectra and optical absorption spectra. It is shown that the joint use of data from various physical methods make it possible to reduce the error of spectroscopic determination of concentrations. If the integrated methods differ significantly by their accuracy, then their integration is not effective. These effects are observed using various machine learning methods: random forest, gradient boosting and artificial neural networks – multilayer perceptrons. A series of experiments with solutions based on river water are also performed to estimate the variability of the fluorescence of natural waters in Moscow. A significant increase in the error level relative to solutions prepared in distilled water is observed. This indicates the need to develop new methods to improve the quality of solution of the investigated problem for diagnostics of real river waters. Keywords: inverse problem · heavy metal ions · optical spectroscopy methods · integration of physical methods · neural networks · gradient boosting · random forest · natural waters
1 Introduction Currently, machine learning (ML) methods are actively used to solve various problems of technological production, ecology, biomedicine associated with the processing of specific types of data, for example, optical spectroscopy data [1, 2]. Spectroscopic methods are the most promising ones for continuous monitoring of the ionic composition of technological and natural aqueous media, wastewater from industrial enterprises, food drinks, because optical methods provide remote and rapid diagnostics of multicomponent media and allow monitoring the situation in real time [3]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 445–454, 2023. https://doi.org/10.1007/978-3-031-44865-2_47
446
A. Guskov et al.
One of the serious advantages of ML methods is their ability to solve multiparameter inverse problems, which provides their successful application for solving inverse problems of optical spectroscopy of multicomponent aqueous media, in particular, the problems of identifying many individual components, as well as determining the concentration of each of them in the presence of others [4]. For example, Li et al. created a portable device for analyzing the composition of multicomponent mixtures by their IR absorption spectra using perceptrons with a single hidden layer. The authors were able to determine the concentrations of glucose, polyethylene glycol, and bovine serum albumin in three-component mixtures with a high accuracy [5]. Various ML methods are actively used in the food industry for qualitative and quantitative analysis using optical spectroscopy of such chemically complex objects as coffee, wine, beer and juices. Thus, the authors of [6] showed that convolutional neural networks can be effectively used to classify vibrational spectroscopic data and identify important spectral regions. In this study, the following identification problems were solved: determination of the regions of wine production, the variety of grains from which coffee was made, the regions of olive oil production, the type of fruit from which juice was made. The authors of [7] developed approaches to determine the concentration of alcohol by the IR absorption spectra of wines and the concentration of sucrose in orange juice from the reflection spectra in the near IR range. Optical absorption and IR absorption spectroscopy in combination with artificial neural networks (multilayer perceptrons) have been used as a non-destructive method for quantitative determination of ethanol, glucose, glycerol, tartaric, malic, acetic, and lactic acid in aqueous solutions [8]. Unfortunately, the trained neural networks have not been applied to the spectra of real wines. However, it should be noted that, despite quite numerous publications devoted to solving multiparameter inverse problems of determining the composition of multicomponent media using optical spectroscopy and ML, a reliable and noise resilient universal method for diagnosing complex media able to provide high accuracy in determining the desired parameters has not yet been developed. In this study, we consider the inverse problem of simultaneously determining the concentration of 8 ions in aqueous solutions by spectral data of three optical methods at once (Raman spectroscopy, infrared (IR) spectroscopy, and spectroscopy of optical absorption (OA)) using ML methods. It was assumed that various optical methods could provide different information about the studied object, which could be used simultaneously by ML methods, thus reducing the error of solving the studied inverse problem. In our preceding study [9], it was found out that this is not always so. In the present study, we find out the efficiency of various combinations of data of the three spectroscopy types used by various ML methods, and also test applicability of the approach in real river water conditions, by testing it on the solutions prepared using water taken from 4 various rivers within Moscow megapolis.
Integration of Data from Various Physical Methods
447
2 Description of Physical Experiment 2.1 Preparation of Solutions In this study, the aqueous solutions of the following salts were investigated: Zn(NO3 )2 , ZnSO4 , Cu(NO3 )2 , CuSO4 , LiNO3 , Fe(NO3 )3 , NiSO4 , Ni(NO3 )2 , (NH4 )2 SO4 , NH4 (NO3 ). The concentration of Zn2+ , Cu2+ , Li+ , Fe3+ , Ni2+ , NH4 + cations varied in the range from 0 to 1 M. The calculation of the concentrations of SO4 2− , NO3 − anions was performed by the concentrations of the corresponding cations. The total concentration of cations in the solution did not exceed 2 M, because otherwise the contents of the solution precipitated. In order to “decouple” the concentration of cations from the concentration of anions, solutions were formed using both nitrate and sulfate ions whenever possible. The solution could simultaneously contain from 2 to 8 ions, produced by dissociation of 1 to 10 salts in their various possible combinations. Different types of water were used to prepare aqueous solutions of the studied salts. Depending on the type of water, the samples were divided into the following series: • Basic series – solutions were prepared with distilled water • «Golden» series – water for the preparation of solutions was taken from the Moscow River near Ostrovnaya Street in the west of Moscow • «Silver» series – samples were prepared with water taken from the Yauza, Bitsa, and Setun rivers within territory of Moscow Water samples from the rivers were taken on different days, at different points. Solutions based on river water were prepared with the same salt concentrations as in the basic series. Also, one sample without salts was prepared for each of the rivers. Preparation of solutions based on river water was necessary to conduct an experiment to evaluate the fluorescence variability of natural waters in Moscow. Raman spectra, IR spectra and OA spectra were measured for each sample. 2.2 Raman Spectroscopy Excitation of the Raman signal was performed by continuous YAG-laser (wavelength 532 nm, laser power 500 mW). Spectra were recorded using the system consisting of monochromator (Acton 2500i, grating 900 grooves/mm) and CCD-camera (1024*256 Syncerity, Horiba Jobin Yvon). Spectra were recorded in 3 spectral ranges: 300– 1770 cm−1 , 1347–2855 cm−1 , 2775–4019 cm−1 . A special procedure was used to combine the three ranges to provide a single spectrum. The signal accumulation time was 10 s (10 cycles of 1 s each). The first and third bands for some of obtained spectra are presented in Fig. 1. One can see the bands of vibrations of complex ions (NO3 − , SO4 2− ) (Fig. 1, left) and the broad band of valence vibrations of water molecules (Fig. 1, right). 2.3 Infrared Spectroscopy Spectra of infrared absorption were obtained using Bruker Invenio R spectrometer, equipped by ATR unit.
448
A. Guskov et al.
Fig. 1. Raman spectra of water and aqueous solutions of salts: 1 – dist. Water; 2 – 0.22M LiNO3 ; 3 – 0.47M Zn(NO3 )2 , 0.62M ZnSO4 ; 4 – 0.22M Cu(NO3 )2 , 0.47M LiNO3 , 0.40M (NH4 )2 SO4 .
Spectra were measured in the range 400–4500 cm−1 with resolution 4 cm−1 . In Fig. 2 one can see spectra of IR absorption of water and solutions of salts Cu(NO3 )2 , LiNO3 , (NH4 )2 SO4 with concentration 1 M each.
Fig. 2. Spectra of IR absorption of water and aqueous solutions of salts: 1- Cu(NO3 )2 , 2 - LiNO3 , 3 - (NH4 )2 SO4 , 4 – dist. Water. Concentrations of salts - 1 M.
In this Figure one can also see the bands of vibrations of complex ions and the band of valence vibrations of water molecules. 2.4 Optical Absorption Spectroscopy Spectra of optical absorption were obtained using Shumadzu UV-1800 spectrophotometer in the spectral region 190–1100 nm with increment 1 nm. Some of the obtained spectra can be seen in Fig. 3.
Integration of Data from Various Physical Methods
449
Fig. 3. Spectra of optical absorption of aqueous solutions of salts: 1- CuSO4 , 2 – Fe(NO3 )3 , 3 NiSO4 , 4 – Zn(NO3 )2 . Concentrations of salts - 1 M.
As one can note, the spectra have a number of features. Thus, in the spectra of aqueous solutions of nitrates, a peak in the region of 300 nm is observed, corresponding to the absorption of the nitrate anion. Absorption bands of nickel cations are observed in the wavelength ranges 350–450 nm, 530–850 nm and 900–1100 nm. A wide peak of absorption of copper cations is observed in the region 600–1000 nm. Aqueous solution of iron nitrate is characterized by intense absorption of iron cations in the wavelength range 200–600 nm.
3 Application of Machine Learning Methods 3.1 Integration of Physical Methods The integration of physical methods was carried out by simultaneously supplying data of two or three types of spectroscopy to the input of ML algorithms. Thus, the results for the following arrays of input data were compared: • • • •
Raman spectroscopy data only Infrared spectroscopy data only Absorption spectroscopy data only Integration of physical methods – simultaneous use of data from two or three types of spectroscopy
3.2 Dataset The dataset for training the ML models were the Raman spectra, IR spectra, and OA spectra of the solutions. The input features were the intensities in the spectrum channels. For the Raman spectra, the dimension of the array of input features was 2598 values, for the IR spectra – 2126 values, for the OA spectra – 911 values, for the integration of two physical methods – 4724, 3509 or 3037 values, and for the integration of all the three physical methods – 5635 values.
450
A. Guskov et al.
The determined parameters were the concentrations of eight ions: Zn2+ , Cu2+ , Li+ , Ni2+ , NH4 + , SO4 2− , NO3 − . Thus, 8 parameters were used as output features. The original dataset, which was used to carry out the study, contained 3760 patterns for the basic series, 400 patterns for the «golden» series, and three «silver» series with 200 patterns each. ML models were trained on the basic series and then applied to the data of all series. The dataset of the basic series was divided into training, validation and test sets, which contained 2660, 700 and 400 patterns, respectively. The training set was used to train the models and the test set for final estimation of the quality of the algorithms on independent data. Depending on the model, the validation set could be used to select the moment to stop training or other parameters of the training algorithm. Fe3+ ,
3.3 Machine Learning Algorithms For each determined parameter, a separate model with a single output was constructed, we call this approach autonomous determination of parameters. This procedure is necessary to reduce the output dimension of the considered inverse problem. In this study, Python implementations of the following ML algorithms were used (hyperparameters were selected by grid search around their default values): • The most widespread neural network (NN) architecture – multilayer perceptron (MLP) [10]. MLP architecture with three hidden layers was used, consisting of 64 + 32 + 16 neurons. The activation function was sigmoidal for the hidden layers and linear for the output one; the learning rate for the hidden and output layers – 0.01; the moment – 0.9. To prevent overfitting of the NN, the early stopping method was used: training stopped after exceeding 500 epochs after the minimum of the mean squared error on the validation set. The applicability of the MLP for solving the IP under investigation is due to the well-known fact that it is a universal approximator. • Gradient boosting (GB) over decision trees (DT) [11]. GB training was carried out with the following parameters: number of DT – 500; DT depth – 3; learning rate – 0.1; fraction of features to consider when looking for the best split in the nodes of the DT – 50% of the total number of features; fraction of samples to be used for fitting the individual DT –80% of the total number of samples; minimum number of objects in a leaf – 1; minimum number of objects required to split a DT node – 2. • Random Forest (RF) [12]. The training parameters of RF: number of DT – 200; DT depth – 10; fraction of features to consider when looking for the best split in the nodes of the DT –50%; minimum number of objects in a leaf – 1; minimum number of objects required to split a DT node – 2. The training of DT was performed on a random subset of patterns of the training set (bootstrap sample).
4 Results of Computational Experiments and Their Discussion At the first stage of the computational experiment (Fig. 1), the dependence of the solution quality shown by ML algorithms (NN, GB and RF) on the used array of input data for each determined parameter on the basic series was investigated. The mean absolute error (MAE) was used as the criterion for evaluating the solution. Columns of a particular color correspond to a particular array of input data.
Integration of Data from Various Physical Methods
451
To obtain a solution that does not depend on random factors (for example, for NN, this factor is a random initialization of the weights), three NN, GB, and RF models each were trained with various initialization (random seed values). The presented results were obtained by averaging the statistics of these three models. When using spectroscopic methods individually, in most cases the models trained on IR spectra had the lowest error level, with the exception of ions that have distinct characteristic bands in the absorption spectra, i.e., Cu2+ and Ni2+ . The integration of physical methods demonstrated results comparable to the individual use of spectroscopic methods or better. An improvement in solution quality by integrating physical methods was observed when the used methods had close error levels. NN outperformed GB and RF in most cases in terms of solution quality. The exception is the Cu2+ ion, for which GB and RF showed a substantially lower error level using data arrays containing OA spectra.
Fig. 4. Columns – solution quality (MAE, M) for various determined parameters. Rows – results for various ML algorithms (NN, GB, RF). Different colors correspond to different arrays of input data.
452
A. Guskov et al.
Fig. 5. Columns – best solution quality (MAE, M) of NN for various determined parameters. Rows – results for solutions prepared in different types of water. Different colors correspond to different arrays of input data (same as in Fig. 4).
Integration of Data from Various Physical Methods
453
At the second stage of the computational experiment (Fig. 5) the quality of the NN solution on the basic, “golden” and “silver” series for each determined parameter was studied. The results corresponding to the arrays of input data on which the NN had the lowest error values are displayed. The integration of physical methods in most cases demonstrated a lower error level than separate use of spectroscopic methods when using spectra of solutions prepared in the river water. The exception is the NO3 − ion, for which the use of IR spectra was preferable in all cases. The difference between the basic, “gold”, and “silver” series is revealed in the array of input data, on which the NN shows the lowest error level. So, in the basic series, the integration of the three physical methods or the integration of Raman and IR spectroscopy data had the best solution quality for all determined parameters. However, in the “gold” and “silver” series, the joint use of IR and OA spectroscopy data, Raman and OA spectroscopy data, or data from the three physical methods proved to be preferable. In addition, in some cases, the lowest error level was shown when spectroscopic methods (IR or OA spectroscopy data) were used individually. Also, when using real natural water, there was a degradation of the solution quality relative to the results obtained on the basic series. The reason for this is the intensity and variability of river water fluorescence. At the same time, for some of the ions the solution quality remained adequate for all series. In addition, it should be noted that in the case of the “gold” and “silver” series, the integration of physical methods made it possible in many cases to reduce losses in the solution quality relative to the basic series in comparison to the individual use of spectra.
5 Conclusion In this study, the inverse problem of spectroscopic determination of concentrations of heavy metal ions in solutions by their Raman spectra, infrared spectra and optical absorption spectra as well as by their simultaneous use was solved. The influence of the integration of physical methods on the magnitude of the error in solving the problem was studied. The following conclusions can be drawn from the results of the study: • Joint use of various physical methods in most cases has improved the solution quality of the problem relative to their individual use. The observed effect is demonstrated for different machine learning algorithms (neural networks, gradient boosting, and random forest). The integration of physical methods also positively affects the error of the inverse problem solution in the case of using real river water. • The degradation of the results at joint use of spectroscopic methods is observed if one of the methods works significantly better than the other. This agrees with the previously observed effect on another dataset for a pair of “optical absorption Raman” methods [9]. • It is necessary to develop new methods to improve the quality of the problem solution in the diagnostics of real waters, since for many ions there is a significant increase in the error level relative to the basic series.
454
A. Guskov et al.
Funding. This study has been performed at the expense of the Russian Science Foundation, grant no. 19-11-00333, https://rscf.ru/en/project/19-11-00333/.
References 1. Mitchell, S., Potash, E., Barocas, S., D’Amour, A., Lum, K.: Algorithmic fairness: choices, assumptions, and definitions. Annu. Rev. Stat. Appl. 8, 141–163 (2021) 2. Chen, Z., Khaireddin, Y., Swan, A.K.: Identifying the charge density and dielectric environment of graphene using Raman spectroscopy and deep learning. Analyst 147(9), 1824–1832 (2022) 3. Sarmanova, O., et al.: Machine learning algorithms to control concentrations of carbon nanocomplexes in a biological medium via optical absorption spectroscopy: how to choose and what to expect? Appl. Opt. 60(27), 8291–8298 (2021) 4. Dolenko, S.A., Burikov, S.A., Dolenko, T.A., Persiantsev, I.G.: Adaptive methods for solving inverse problems in laser Raman spectroscopy of multi-component solutions. Pattern Recognit Image Anal. 22, 550–557 (2012) 5. Li, Z., et al.: Smart ring resonator–based sensor for multicomponent chemical analysis via machine learning. Photonics Res. 9(2), B38–B44 (2021) 6. Acquarelli, J., van Laarhoven, T., Gerretzen, J., Tran, T.N., Buydens, L.M.C., Marchiori, E.: Convolutional neural networks for vibrational spectroscopic data analysis. Anal. Chim. Acta 954, 22–31 (2017) 7. Martelo-Vidal, M.J., Vázquez, M.: Application of artificial neural networks coupled to UV– VIS–NIR spectroscopy for the rapid quantification of wine compounds in aqueous mixtures. CyTA – J. Food 13(1), 32–39 (2014) 8. Malek, S., Melgani, F., Bazi, Y.: One-dimensional convolutional neural networks for spectroscopic signal regression. J. Chemom. 32(5), e2977–1–17 (2017) 9. Guskov, A., Laptinskiy, K., Burikov, S., Isaev, I.: Integration of data and algorithms in solving inverse problems of spectroscopy of solutions by machine learning methods. In: Kryzhanovsky, B., Dunin-Barkowski, W., Redko, V., Tiumentsev, Y. (eds.) Advances in Neural Computation, Machine Learning, and Cognitive Research VI. NEUROINFORMATICS 2022. SCI, vol. 1064, pp. 395–405. Springer, Cham (2023). https://doi.org/10.1007/978-3031-19032-2_41 10. Keras: Deep Learning for Humans. https://keras.io/. Accessed 19 June 2023 11. Gradient Boosting Regressor in scikit-learn. https://scikit-learn.org/stable/modules/genera ted/sklearn.ensemble.GradientBoostingRegressor.html. Accessed 19 June 2023 12. Random Forest Regressor in scikit-learn. https://scikit-learn.org/stable/modules/generated/ sklearn.ensemble.RandomForestRegressor.html. Accessed 19 June 2023
The Use of a priori Information in the Neural Network Solution of the Inverse Problem of Exploration Geophysics Igor Isaev1,2(B) , Ivan Obornev1,3 , Eugeny Obornev3 , Eugeny Rodionov3 , Mikhail Shimelevich3 , and Sergey Dolenko1 1 D.V. Skobeltsyn Institute of Nuclear Physics, M.V. Lomonosov Moscow State University,
Moscow, Russia [email protected], [email protected] 2 Kotelnikov Institute of Radio Engineering and Electronics, Russian Academy of Sciences, Moscow, Russia 3 S.Ordjonikidze Russian State Geological Prospecting University, Moscow, Russia
Abstract. This study is devoted to solving inverse problems of exploration geophysics, which consist in reconstructing the spatial distribution of the properties of the medium in the thickness of the earth from the geophysical fields measured on its surface. We consider the methods of gravimetry, magnetometry, and magnetotelluric sounding, as well as their integration, i.e. simultaneous use of data from several geophysical methods to solve the inverse problem. To implement such integration, in our previous studies we have proposed a parameterization scheme that describes a layered geophysical model with fixed layer properties, in which the determined parameters were the positions of the boundaries between the layers. In the present study, this parameterization scheme is complicated so that the properties of the layers vary from pattern to pattern in the data set. To improve the quality of neural network solution of the described inverse problem, we consider an approach based on the use of a priori information about the physical properties of the layers, in which this information is used directly as additional input features for the neural network. Keywords: Inverse Problems · Exploration Geophysics · A priori Information · Integration of Geophysical Methods · Joint Inversion · Neural Network
1 Introduction This study is devoted to solving the inverse problems (IP) of exploration geophysics (EG). The general statement of such an IP consists in reconstructing the distribution of some physical parameters of the medium in the thickness of the earth crust from the physical fields measured on its surface. The purpose is to study the structure of the near-surface layer of the earth, searching for useful fossils. Here we consider the IP of gravimetry (G), magnetometry (M) and magnetotelluric sounding (MT), which consist in reconstructing the spatial distribution of density, magnetization, and electrical © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 455–464, 2023. https://doi.org/10.1007/978-3-031-44865-2_48
456
I. Isaev et al.
resistance of the medium in the earth crust by the values of gravitational, magnetic and magnetotelluric fields, respectively. All these IPs are ill-posed, which generally leads to a low quality of the solution and high sensitivity to noise in the input data. A general approach to reducing the ill-posedness of IPs is to use additional information. Wherein, both the response of the system to other types of external influences and a priori knowledge about the system under consideration can be used as such additional information. The first approach is known as the integration of several physical methods or as joint inversion and it can be used with both traditional numerical solution methods [1, 2] and machine learning methods [3–7], and when they are used together [8, 9]. In the second approach, a priori knowledge about the system can be either taken into account at the stage of creating a training data set or directly embedded in the machine learning model. When forming a training set, the a priori information can be assumptions about the structure of the geological section, and its implementation consists in using parameterization schemes with a rigidly specified spatial structure (the so-called “classgenerating models”) [10–12] or in using narrow models of the media that describe some certain class of geological sections, for example, horizontal-layered model [6, 7, 9, 13]. The injection of a priori information into machine learning methods can be carried out through the use of physical equations within the model itself, for example, in physicsinformed neural networks [14–16]. Another approach is to feed a priori information directly as additional input features to a neural network. Also, a priori information can be used in other ways, for example, in preprocessing or feature selection [17]. The object of the authors’ investigations is the integration of geophysical methods. To implement such integration, in our previous studies, a parameterization scheme was proposed that describes a layered geophysical model with fixed layer properties. In the present study, this parameterization scheme is complicated so that the properties of the layers vary among patterns in the data set. The purpose of this study is to consider an approach to improving the quality of neural network solution of the described IP, based on the use of a priori information about the properties of the layers, in which this information is fed directly as input features to a neural network.
2 Physical Statement of the Problem 2.1 Parameterization Schemes In order to implement the integration of various geophysical methods when solving an IP, it is necessary that the determined parameters of each of the methods are the same. This approach corresponds to the geometric formulation of the problem, which consists in determination of the boundaries of geophysical objects. In particular, in this study we consider two parameterization schemes, in which the parameters describe the boundaries of geological layers of a layered medium. Both parameterization schemes were fourlayer two-dimensional models (Fig. 1), corresponding to a section of the Norilsk region. The first layer modeled the basalt layer, the second and the fourth ones – terrigenous carbonate deposits of the Tunguska series, the third one – the gabbro-dolerites massive copper-nickel-platinum ores.
The Use of a priori Information in the Neural Network Solution
457
The dimension of the section was 15 km wide and 3 km deep. The physical field measurement step is 0.5 km – a total of 31 measurement points along the profile. The discreteness of changing the boundaries of geological layers along the profile is 1 km – a total of 15 depth values for each layer. In this problem, the values of the depths of the lower boundaries of the three upper layers were determined. The discreteness of changing the values of depth was 0.02 km. For the first parametrization scheme, each layer was characterized by fixed values of density, magnetization, and resistivity, which did not change within the layer, and which were the same across the entire data set. The physical characteristics of the second and the fourth layers were the same for both. For the second parameterization scheme, the physical properties of the layers also did not change within the layer, but varied randomly from patterns to pattern, within ±10% of the values of the first parameterization scheme. The physical characteristics of the fourth layer were the same as those in the first parameterization scheme. The values of the physical and spatial characteristics of the layers are shown in Table 1. For physical properties, the single value specified in the upper line for each layer, correspond to the first parameterization scheme; the range specified in the lower line for each layer, correspond to the second parameterization scheme. Table 1. Physical and spatial properties of the layers. Layer
Description
Physical properties
Spatial properties
Density σ , kg/m3
Magnetization μ, A/m
Resistivity ρ, ·m
Lower bound min – max, km
Thickness, min – max, km
1
Basalt
2800 2520–3080
3.0 2.7–3.3
2000 1800–2200
1.00–1.48
1.00–1.48
2
Terrigenous carbonate deposits of the Tunguska series
2550 2295–2805
0.5 0.45–0.55
100 90–110
1.80–1.98
0.32–0.98
3
Gabbro-dolerites massive copper-nickel-platinum ores
3000 2700–3300
0.9 0.81–0.99
1000 900–1100
2.20–1.98
0.22–0.48
4
Terrigenous carbonate deposits of the Tunguska series
2550
0.5
100
–
–
2.2 Data Original datasets were formed in the following way. Initially, for each pattern, layer depths values were randomly set in the ranges shown for spatial properties in Table 1. Then, fixed values of physical parameters were set for the first parameterization scheme and random values from the ranges shown in Table 1 for physical properties were set for the second parametrization scheme. Further, by solving the direct problem by finite difference methods, the values of geophysical fields were calculated for each of the selected geophysical methods. So, the patterns of both parametrization schemes coincided in terms of spatial parameters.
458
I. Isaev et al.
Fig. 1. An example of the geological section within the considered parameterization schemes (top), and the corresponding components of the fields used in this study. Solid lines – the first parameterization scheme with fixed properties of the layers, dashed lines – the second parameterization scheme with unfixed properties of layers (varying from pattern to pattern).
The Use of a priori Information in the Neural Network Solution
459
The input dimension of the problem was: • • • •
Gravimetry: 1 field component * 31 measurement point (picket) = 31 feature Magnetometry: 1 field component * 31 picket = 31 feature MTS: 2 field components * 1 frequency * 31 picket = 62 features Each geophysical method also had three values of the physical properties of the layers, which could optionally be used as input features. The output dimension of the problem was:
• 3 layers * 15 values of layer boundary depth = 45 parameters. A total of 30 000 patterns were calculated. An example of a geophysical section and the corresponding fields of the first and second parameterization schemes is shown in Fig. 1.
3 Methodical Statement of the Problem 3.1 Datasets The original datasets were divided into training, validation and test subsets in such a way that patterns with the same spatial parameters fell into the same subsets. The ratio of the number of patterns in the subsets was 70:20:10 and their dimensions were 21,000, 6,000, and 3,000 patterns, respectively. 3.2 Reducing the Dimensionality of the Problem To reduce the output dimensionality, the so-called autonomous parameter determination [6] was used, where a separate neural network with one output was trained for each determined parameter. For the gravimetry and magnetometry problems, no reduction of the input dimension was carried out. For the magnetotelluric problem, the reduction was performed by taking only two field components at the same frequency, to provide an acceptable quality of the solution [17]. 3.3 Neural Networks All neural networks (NN) used in this study were trained in the same way. The type of NN used was the multilayer perceptron, which is a universal approximator. The architecture used had a single hidden layer with 32 neurons in it. To reduce the factor associated with the influence of the initialization of weights on the training of NN, 5 networks were used for each case under consideration, and the statistical indicators of their application were averaged. To prevent overtraining, early stopping by validation dataset was used. Training stops after 500 epochs with no improvement of the mean squared error (MSE) on the validation set.
460
I. Isaev et al.
3.4 Use of Additional a Priori Information The training of NN only on geophysical fields values was considered as a reference. The use of a priori information was carried out by adding the values of the physical properties of the layers as additional input features. In total - three features for each geophysical method. 3.5 Integrating Geophysical Methods and the Input Dimension When integrating geophysical methods, the data of two or three geophysical methods were simultaneously fed to the input of the NN. For individual use of data from the gravimetry and magnetometry methods, the NN input was fed by 31 (31 + 3) features, for individual use of MTS data – 62 (62 + 3) features, for simultaneous use of data from two geophysical methods – 62 (62 + 6) or 93 (93 + 6) features, for simultaneous use of data from all the three methods – 124 (124 + 9) features.
4 Results In this study, as an indicator of the quality of the solution, we used the relative error calculated as the root mean squared error (RMSE) normalized by the range of change of the determined parameter. The results of NN solution of the EG IP for various input data and for different train and test data are presented in Fig. 2. The results of the NN solution of the EG IP for the more general second parameterization scheme with variable layer properties (yellow bars), as expected, turned out to be worse than for the narrow first parameterization scheme with fixed layer properties (red bars). On the other hand, the use of a narrow parametrization scheme can be considered as an indirect use of a priori information about the layer properties. So, in this case, such use of a priori information gives a positive effect. Direct addition of information about the physical properties of the layers as additional input features for NNs (light blue bars) makes it possible to improve the quality of the NN solution of the EG IP relative to the basic solution (yellow bars). Thus, also with the direct use of a priori information in such form, an improvement in the quality of the solution is observed. Cross-application of NNs trained on one data to other data shows the following results. When training NNs on the data of the narrower parameterization scheme with fixed layer properties and applying to the data of the parameterization scheme with variable layer properties (light red bars), the quality of the solution, as expected, also deteriorated compared to applying NNs to the same data (red bars).
The Use of a priori Information in the Neural Network Solution
461
Fig. 2. The quality (relative root mean squared error) of the solution of the IP for various input data and for different train and test data. G, M, MT – individual use of gravimetry, magnetometry and MTS data; G+M, G+MT, M+MT, G+M+MT – simultaneous use of data from several geophysical methods (integration of methods). The first character in the legend denotes the training data, the second character – the test data. f – data corresponding to parameterization scheme with fixed properties of layers, u - data corresponding to parameterization scheme with unfixed (varying by patterns) properties of layers, au - training with additional a priori information about layer properties for a parameterization scheme with unfixed layer properties.
462
I. Isaev et al.
When training NNs on the data of the parameterization scheme with variable layer properties and applying to the data of the parameterization scheme with fixed layer properties (orange and blue bars), the quality of the solution is slightly improved compared to applying networks to the same data (yellow and light blue bars). This effect is observed both when training NNs only on the data of geophysical fields (orange and yellow bars), and with additional use of a priori information about the physical properties of the layers (blue and light blue bars). The reason for this effect can be the following: the parametrization scheme with fixed values of the physical properties of the layers is a subset of the parametrization scheme with variable properties of the layers, and the fixed properties of the layers are exactly in the center of the distribution of the variable properties. For all the three layers and for all combinations of training and test data, simultaneous use of data from any two geophysical methods reduces the error compared to the individual use of data from any of them. The best result is shown by the simultaneous use of data from all the three geophysical methods.
5 Conclusion Based on the results of this study, the following conclusions can be drawn regarding the NN solution of the EG IP: • The positive effect from data integration of different geophysical methods, previously observed in a scheme with fixed layer properties, was confirmed in a parameterization scheme with variable layer properties for a dataset. – The use of data of any two geophysical methods to solve the EG IP shows a better result than using each method separately. – The best result was provided by simultaneous use of the data from all the three geophysical methods. • The use of a priori information in the NN solution of the EG IP has a positive effect. – Direct addition of information about the physical properties of the layers as input features makes it possible to improve the quality of the NN solution of the EG IP. – Indirect introduction of a priori information through the use of a narrower parameterization scheme with fixed layer properties shows a better result of the NN solution of the EG IP compared to using a more universal parameterization scheme with variable layer properties over the data set. The positive results of this study give grounds for further complication of parameterization schemes in the direction of their greater generalization and universality. Funding. This study has been performed at the expense of the Russian Science Foundation, grant no. 19–11-00333, https://rscf.ru/en/project/19-11-00333/.
The Use of a priori Information in the Neural Network Solution
463
References 1. Gallardo, L.A., Fontes, S.L., Meju, M.A., Buonora, M.P., De Lugao, P.P.: Robust geophysical integration through structure-coupled joint inversion and multispectral fusion of seismic reflection, magnetotelluric, magnetic, and gravity images: example from Santos Basin, offshore Brazil. Geophysics 77(5), B237–B251 (2012). https://doi.org/10.1190/geo20110394.1 2. Cai, H., Zhdanov, M.S.: Joint inversion of gravity and magnetotelluric data for the depth-tobasement estimation. IEEE Geosci. Remote Sens. Lett. 14(8), 1228–1232 (2017). https://doi. org/10.1109/LGRS.2017.2703845 3. Akca, ˙I, Günther, T., Müller-Petke, M., Ba¸sokur, A.T., Yaramanci, U.: Joint parameter estimation from magnetic resonance and vertical electric soundings using a multi-objective genetic algorithm. Geophys. Prospect. 62(2), 364–376 (2014). https://doi.org/10.1111/1365-2478. 12082 4. Roux, E., et al.: Joint inversion of long-period magnetotelluric data and surface-wave dispersion curves for anisotropic structure: application to data from Central Germany. Geophys. Res. Lett. 38(5), L05304 (2011). https://doi.org/10.1029/2010GL046358 5. Yadav, A., Yadav, K., Sircar, A.: Feedforward neural network for joint inversion of geophysical data to identify geothermal sweet spots in Gandhar, Gujarat, India. Energy Geosci. 2(3), 189–200 (2021). https://doi.org/10.1016/j.engeos.2021.01.001 6. Isaev, I., Obornev, I., Obornev, E., Rodionov, E., Shimelevich, M., Dolenko, S.: Integration of geophysical methods for solving inverse problems of exploration geophysics using artificial neural networks. In: Kosterov, A., Bobrov, N., Gordeev, E., Kulakov, E., Lyskova, E., Mironova, I. (eds.) Problems of Geocosmos–2020. Springer Proceedings in Earth and Environmental Sciences, pp. 77–87. Springer, Cham (2022). https://doi.org/10.1007/978-3-03091467-7_7 7. Isaev, I., Obornev, I., Obornev, E., Rodionov, E., Shimelevich, M., Dolenko, S.: Multitasking learning in missing data recovery for the integration of geophysical methods in solving an inverse problem of exploration geophysics. Procedia Comput. Sci. 213, 777–784 (2022). https://doi.org/10.1016/j.procs.2022.11.134 8. Hu, Y., et al.: Deep learning-enhanced multiphysics joint inversion. In: First International Meeting for Applied Geoscience & Energy, pp. 1721–1725. Society of Exploration Geophysicists (2021). https://doi.org/10.1190/segam2021-3583667.1 9. Zhou, H., et al.: Joint inversion of magnetotelluric and seismic travel time data with intelligent interpretation of geophysical models. In: Second International Meeting for Applied Geoscience & Energy, pp. 1900–1904. Society of Exploration Geophysicists and American Association of Petroleum Geologists (2022). https://doi.org/10.1190/image2022-3751528.1 10. Spichak, V., Popova, I.: Artificial neural network inversion of magnetotelluric data in terms of three-dimensional earth macroparameters. Geophys. J. Int. 142(1), 15–26 (2000). https:// doi.org/10.1046/j.1365-246x.2000.00065.x 11. Spichak, V., Fukuoka, K., Kobayashi, T., Mogi, T., Popova, I., Shima, H.: ANN reconstruction of geoelectrical parameters of the Minou fault zone by scalar CSAMT data. J. Appl. Geophys. 49(1–2), 75–90 (2002). https://doi.org/10.1016/S0926-9851(01)00100-8 12. Montahaei, M., Oskooi, B.: Magnetotelluric inversion for azimuthally anisotropic resistivities employing artificial neural networks. Acta Geophys. 62(1), 12–43 (2014). https://doi.org/10. 2478/s11600-013-0164-7 13. Isaev, I., Obornev, E., Obornev, I., Shimelevich, M., Dolenko, S.: Neural network recognition of the type of parameterization scheme for magnetotelluric data. In: Kryzhanovsky, B., DuninBarkowski, W., Redko, V., Tiumentsev, Y. (eds.) NEUROINFORMATICS 2018. SCI, vol. 799, pp. 176–183. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-01328-8_19
464
I. Isaev et al.
14. Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707 (2019). https://doi.org/10.1016/j.jcp.2018. 10.045 15. Liu, W., Wang, H., Xi, Z., Zhang, R., Huang, X.: Physics-driven deep learning inversion with application to magnetotelluric. Remote Sens. 14(13), 3218 (2022). https://doi.org/10.3390/ rs14133218 16. Guo, R., Huang, T., Li, M., Zhang, H., Eldar, Y.C.: Physics-embedded machine learning for electromagnetic data imaging: examining three types of data-driven imaging methods. IEEE Signal Process. Mag. 40(2), 18–31 (2023). https://doi.org/10.1109/MSP.2022.3198805 17. Isaev, I., et al.: Using domain knowledge for feature selection in neural network solution of the inverse problem of magnetotelluric sounding. In: Samsonovich, A.V., Gudwin, R.R., Simões, A.D.S. (eds.) BICA 2020. AISC, vol. 1310, pp. 115–126. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-65596-9_15
Neural Network Theory, Concepts and Architectures
Study of Rescaling Mechanism Utilization in Binary Neural Networks Ilia Zharikov(B) and Kirill Ovcharenko Moscow Institute of Physics and Technology, Moscow, Russia [email protected]
Abstract. Single Image Super Resolution (SISR) is a common task on devices to enhance quality of visual data. Deep Convolutional Neural Networks (DCNN) have recently shown great results in this field. However, DCNN are not compatible with resource-limited devices, because they demand significant amounts of memory and computations. Binary neural networks (BNN) provide a promising approach to reduce computational complexity and speed up the inference of a model. SISR models are much more vulnerable to degradation in performance when decreasing the precision of weights, than image classification models, due to the complexity of the task that relies on dense pixel-level predictions. Hence, they suffer a significant quality drop when being binarized. The paper investigates the importance of restricting information in BNN and proposes several binary block modifications based on different rescaling mechanisms. We implement different rescaling modules into the binary block and prove them to increase model performance. We propose various binary block modifications and find the most effective rescaling block locations. Our method shows state-of-the-art results, quantitatively proving the importance of applying rescaling in the binary block. Our modifications outperform existing BNN on benchmark datasets, showing the importance of the rescaling mechanism for increasing BNN quality.
Keywords: Binary Neural Network Binarization · Model compression
1
· Single Image Super Resolution ·
Introduction
Single Image Super Resolution (SISR) [22] aims to restore High Resolution (HR) image from corrupted Low Resolution (LR) counterpart. This task is important because of its various applications in medical imaging [5] and surveillance instruments [1]. In spite of the research in this field being active, it has not progressed a lot, as some challenges were encountered. Main obstacle is desired output in SISR tasks being much more diverse than input, so the model is required to do dense pixel-level prediction, hence is bound to be more complex. Recent advances in the field of SISR owe their success to Deep Convolutional Neural Networks (DCNN) which show state-of-the-art results in a wide range of c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 467–478, 2023. https://doi.org/10.1007/978-3-031-44865-2_49
468
I. Zharikov and K. Ovcharenko
computer vision problems, such as image classification, semantic segmentation etc. However, these models are usually complicated and demand a lot of space and computational resources, thus hindering their implementation on mobile devices, drones and other machines limited in GPU memory. Lately, different methods of reducing complexity of these models were proposed. While some papers focus on pruning [28] and knowledge distillation [27], other researches introduce quantization [6] as a way to decrease memory needed. The most extreme form of quantization is binarization. Binary Neural Networks (BNN) use only 1 bit to represent each parameter, drastically decreasing space demanded to store the model. Moreover, with all parameters of the model set to {−1, 1}, most of the calculations can be conducted using XNOR and Bitcount operations [21]. This approach seems promising, as it proposes new ways to design hardware that can help to handle and exploit big and complex neural networks. This paper suggests several modifications of convolutional block that help to improve BNN’s performance in SISR tasks. Our contributions can be summarized as follows: – We advance the idea from [25] and [8] to restrict information from the input to increase learning productivity by implementing rescaling modules into binary block. – We comprehensively investigate possible rescaling blocks’ positions and find the most effective modification. – Our method shows state-of-the-art results, quantitatively proving the importance of applying rescaling in the binary block.
2 2.1
Related Work Binary Neural Networks
It is obvious that BNN sacrifice precision and quality, as they have much less capacity and representational potential than Full-Precision (FP) networks. ReactNet [17] suggests generalized binarization and activations functions that help to shift distribution, which significantly increases representational capacity of the binary model. The paper [3] presents a BNN architecture search framework with bimodal parameter perturbation that reduces the sharpness of the loss surface and improves stability of the training process. LAB [7] proposes a learnable activation binarizer which allows the network to learn layer-wise binarization kernels. AdaSTE [12] suggests an adaptive variant of the original straight-through estimator that can act like a linear mapping in the backward pass. BinaryViT [13] increases representational capability of transformerbased binary architectures by introducing CNN operations into the network. IR-Net [20] reduces information loss by balancing weights to achieve maximum information entropy in forward propagation. After that, IR2Net [25] proposes two essential components of the learning process: Information Restriction and Information Recovery. BNext [8] also applies attention mechanism to obtain the
Rescaling In Binary Neural Networks
469
key information from the full-precision activations and smooth out the loss landscape. However, last two papers investigate only the impact of these methods on performance of Image Classification models. Another way of extracting necessary information was proposed in [9], where a squeeze-and-excitation block is added to every transformation to a feature map, so that it can learn dependencies between channels (which are expected to concentrate on different features). 2.2
Binary Neural Networks in SISR Task
Previous works in this field propose different methods of maintaining competitive accuracy while achieving better performance. The paper [18] focuses on residual block binarization, which helps to reduce a significant part of the model’s parameters. However, full-precision activations keep computational complexity of the model pretty high. The BBCU [24] proposed effective Convolutional Unit that can be used in any architecture that relies on residual connections. It provides much more efficient training and inference, but oversimplifies weight binarization. Moreover, the block modification implies disposal of the batch normalization block. However, recent studies [23] show that models without the BatchNorm module (e.g. EDSR [16]) introduces a performance drop during binarization as they have diverse distributions between channels and layers. The paper [15] proposes a binarization scheme based on local means, which considers the distinctive features of the SISR problem.
3 3.1
Proposed Method Preliminaries
In this section we define basic binarization operations that are used to build the Binary Convolutional Block. Let Xtf ∈ RH×W ×Cin and Wtf ∈ RKh ×Kw ×Cin ×Cout be full-precision activations and full-precision convolution weights on the t-th layer respectively. Here H and W denote the input feature map height and width, Kh and Kw are the height and width of the convolution kernel respectively, Cin stands for the number of input channels and Cout is the number of output channels. Then Xtb ∈ {−1, 1}H×W ×Cin , Wtb ∈ {−1, 1}Kh ×Kw ×Cin ×Cout would be the binary representations for the corresponding full-precision parameters. The binary representations can be obtained through sign function as follows: +1, xfi,j,k > 0 f xbi,j,k = Sign(xi,j,k ) = , (1) −1, xfi,j,k ≤ 0 where i ∈ [0, H) ∩ Z, j ∈ [0, W ) ∩ Z, k ∈ [0, Cin ) ∩ Z, xfi,j,k ∈ Xtf , xbi,j,k ∈ Xtb are single full-precision and binary activations respectively. Obviously, the derivative of the Sign function cannot be utilized in the training process, as it is impossible to propagate gradients through it. We use an approximation of the Sign derivative, which is defined as follows:
470
I. Zharikov and K. Ovcharenko
Approx
∂Sign(xi,j,k ) ∂xi,j,k
⎧ ⎪ ⎨2 + 2xi,j,k , = 2 − 2xi,j,k , ⎪ ⎩ 0,
−1 ≤ xfi,j,k < 0 0 ≤ xfi,j,k < +1 , otherwise
where i ∈ [0, H) ∩ Z, j ∈ [0, W ) ∩ Z, k ∈ [0, Cin ) ∩ Z. We use PReLU activation function, which is defined as follows: xi,j,k xi,j,k ≥ 0 P ReLU (xi,j,k ) = , ak · xi,j,k xi,j,k < 0
(2)
(3)
where i ∈ [0, H) ∩ Z, j ∈ [0, W ) ∩ Z, k ∈ [0, Cin ) ∩ Z, xi,j,k ∈ RH×W ×Cin is an element of the input feature map, ak ∈ R is the learnable parameter controlling the negative slope. During convolution weights binarization we use the scaled Sign function: f +αl , wi,j,k,l >0 f b wi,j,k,l = αl · Sign(wi,j,k,l ) = , (4) f −αl , wi,j,k,l ≤0 where i ∈ [0, H) ∩ Z, j ∈ [0, W ) ∩ Z, k ∈ [0, Cin ) ∩ Z, l ∈ [0, Cout ) ∩ Z. Fullf b precision and binary weights are denoted as wi,j,k,l ∈ Wtf , wi,j,k,l ∈ Wtb and αl ∈ R represents the scale factor. Optimization task of finding the optimal scale factor for binary weights can be expressed as follows: f b αl∗ = arg min ||Wt,l − αWt,l || α
(5)
f b where Wt,l ∈ RKh ×Kw ×Cin , Wt,l ∈ {−1, 1}Kh ×Kw ×Cin . In this work we use f f optimal value αl = α∗ = ||Wt,l ||1 n, where n is the number of elements in Wt,l .
3.2
Rescaling Mechanism
EDSR [16] showed that applying Batch Normalization has a negative impact on quality when dealing with pixel-level tasks, such as SISR. But the experiments conducted in [24] show that spreading the distribution of values is necessary for BNNs. For that reason, we introduce scale block that applies linear transformation to the output of binary convolution. During binarization, model is bound to lose some representational capacity and suffer a performance decrease. Previous researches [8,25] focus on applying attention mechanism to help the model to capture the most important features and dependencies. Further advancing the idea of restricting information, we suggest rescaling modules that help the model to dynamically extract necessary features from the input. We investigate several different rescaling mechanisms (Fig. 1): a simple Squeeze-and-Excitation rescaling, a Spatial and a Channel rescaling blocks, which are denoted as SE , Spat, Chan respectively.
Rescaling In Binary Neural Networks
471
Fig. 1. Overview of the proposed rescaling blocks architectures.
It is worth to mention that the sigmoid function produces only positive values, keeping activation values unaffected during forward pass, but helping to equalize feature map ranges across different dimensions in binary or residual branches, which can be crucial for the final performance [24]. 3.3
Binary Convolutional Block Design
Previous researches displayed the importance of the residual connection in the Binary Convolution Block, especially in the SISR task, so we keep it for every binary convolution to transfer the full-precision information through the block. Moreover, BBCU [24] shows that activation function narrows the negative part of the residual connection, thus losing negative full-precision information. On that account, we keep the idea of moving the residual connection out of the activation function. Base block consists of two similar binary blocks. Single residual connection is applied to the final results. The paper [11] investigates the importance of implementing two residual connections into the block and proves it to maintain a more continuous information flow. Quantitative results obtained from the Table 1 coincide with the previous research results. Thus, we based our experiments on the Base Residual block structure, which applies residual connections to both convolutions in the block. We propose several modifications that aim to either restrict information from the previous layer or change the distribution of binary convolution inputs and outputs. The proposed binary block consists of two similar parts, connected with two residual connections. Each part can be expressed as follows: f = P ReLU (Ot ) · R2 (Ot ) · R3 (Xtf ) · R4 (Xtf ) + Xtf · R5 (Xtf ) Xt+1
(6)
where Ot = at · (Xtb ∗ Wtb ) + bt , Xtb = Sign(R1 (Xtf )) are the binary activations, at and bt denote learnable scaling and shifting parameters, ∗ is convolution
472
I. Zharikov and K. Ovcharenko
operation. For different i ∈ {1, ... , 5} the functions Ri , i ∈ {1, ... , 5} are substituted with different rescaling blocks or Identity operator to get the specific modification: Ri ∈ {I, SE , Spat, Chan}. Table 1. Quantitative comparison of Base and Base Residual blocks’ results. Models are evaluated using PSNR and SSIM metrics, calculated on 4 benchmark datasets: Set5, Set14, B100, Urban100. Best viewed in zoom. Modification
Set5 PSNR
SSIM
Set14 PSNR
SSIM
B100 PSNR
SSIM
Urban100 PSNR
SSIM
32.01
0.892
28.47
0.778
27.47
0.732
25.70
0.772
x4 scale SRResNet
Base 31.68 −0.33 0.888 −0.004 28.26 −0.21 0.774 −0.004 27.34 −0.13 0.728 −0.004 25.35 −0.35 0.760 −0.012 Base-Residual 31.68 −0.33 0.888 −0.004 28.27 −0.20 0.774 −0.004 27.35 −0.12 0.728 −0.004 25.38 −0.32 0.761 −0.011 x2 scale EDSR
38.05
0.978
33.57
0.966
32.19
0.978
32.04
0.911
Base 37.61 −0.44 0.975 −0.003 33.09 −0.48 0.963 −0.003 31.85 −0.34 0.977 −0.001 30.79 −1.25 0.905 −0.006 Base-Residual 37.60 −0.45 0.976 −0.002 33.07 −0.50 0.963 −0.003 31.85 −0.34 0.977 −0.001 30.80 −1.24 0.905 −0.006
3.4
Complexity Analysis
We evaluate complexity of utilized models using number of MACs operations. For binary MAC operations it is calculated according to the following formula: MACsb = MACsf 64.
4
Experiments
We train models on DIV2K [2] dataset, which contains 800 High Resolution images and evaluate them on 4 benchmark datasets: Set5 [4], Set14 [26], B100 [19] and Urban100 [10]. Experiments are conducted using two different backbones: EDSR [16] and SRResNet [14]. Binarized EDSR is applied for ×2 scale and is trained using patches of size 96×96. Binarized SRResNet is applied for ×4 scale and is trained using patches of size 128 × 128. 4.1
Block Modification Analysis
RescalePre vs. RescalePost. Firstly, we investigate the importance of rescaling the inputs and outputs of the binary convolution, proposing RescalePre and RescalePost blocks. RescalePre block’s formula can be obtained from 6 by substituting R1 by one of the rescaling blocks and Ri = I for i ∈ {2, ... , 5}, where I is an identity operator. For the RescalePost block we imply R2 ∈ {SE, Spat, Chan} to be one of the rescaling blocks and other Ri = I. Comparison of the RescalePre and RescalePost results presented in Table 2 is consistent with previous researches on the importance of scaling the output of the binary branch. RescalePre block doesn’t affect forward pass of the model
Rescaling In Binary Neural Networks
473
Table 2. Quantitative comparison of RescalePre and RescalePost blocks’ results. Models are evaluated using PSNR and SSIM metrics, calculated on 4 benchmark datasets Set5, Set14, B100, Urban100. Modification names’s suffix represents the rescaling block that is being used. Best viewed in zoom. Modification
Set5 PSNR
SSIM
Set14 PSNR
SSIM
B100 PSNR
SSIM
Urban100 PSNR
SSIM
32.01
0.892
28.47
0.778
27.47
0.732
25.70
0.772
x4 scale SRResNet RescalePre
RescalePre-Channel RescalePre-SE RescalePre-Spatial
31.56 −0.45 0.887 −0.005 28.19 −0.28 0.772 −0.006 27.30 −0.17 0.726 −0.006 25.27 −0.43 0.757 −0.015 31.55 −0.46 0.886 −0.006 28.17 −0.30 0.772 −0.006 27.28 −0.19 0.726 −0.006 25.24 −0.46 0.755 −0.017 31.53 −0.48 0.886 −0.006 28.18 −0.29 0.772 −0.006 27.29 −0.18 0.726 −0.006 25.24 −0.46 0.756 −0.016
RescalePost RescalePost-Channel 31.76 −0.25 0.889 −0.003 28.28 −0.19 0.774 −0.004 27.36 −0.11 0.728 −0.004 25.47 −0.23 0.764 −0.008 RescalePost-SE 31.78 −0.23 0.890 −0.002 28.28 −0.19 0.774 −0.004 27.36 −0.11 0.728 −0.004 25.45 −0.25 0.763 −0.009 RescalePost-Spatial 31.66 −0.35 0.888 −0.004 28.26 −0.21 0.774 −0.004 27.33 −0.14 0.727 −0.005 25.36 −0.34 0.760 −0.012 x2 scale EDSR RescalePre
38.05 RescalePre-Channel RescalePre-SE RescalePre-Spatial
RescalePost RescalePost-Channel RescalePost-SE RescalePost-Spatial
0.978
33.57
0.966
32.19
0.978
32.04
0.911
37.49 −0.56 0.974 −0.004 32.98 −0.59 0.963 −0.003 31.77 −0.42 0.977 −0.001 30.55 −1.49 0.904 −0.007 37.28 −0.77 0.973 −0.005 32.83 −0.74 0.963 −0.003 31.62 −0.57 0.976 −0.002 30.06 −1.98 0.902 −0.009 37.61 −0.44 0.975 −0.003 33.06 −0.51 0.963 −0.003 31.82 −0.37 0.977 −0.001 30.69 −1.35 0.905 −0.006 37.65 −0.40 0.975 −0.003 33.06 −0.51 0.963 −0.003 31.83 −0.36 0.977 −0.001 30.73 −1.31 0.905 −0.006 37.61 −0.44 0.976 −0.002 33.07 −0.50 0.963 −0.003 31.84 −0.35 0.977 −0.001 30.87 −1.17 0.906 −0.005 37.63 −0.42 0.976 −0.002 33.12 −0.45 0.963 −0.003 31.87 −0.32 0.977 −0.001 30.89 −1.15 0.906 −0.005
when using regular Sign function, having influence only on the backward pass. On the other hand, RescalePost block prevents overlapping of the information between two branches, similar to scaling in BBCU [24]. Thus, the RescalePost modification shows better performance by providing more uniform flow of binary and full-precision information. Single vs. RescalePost. The outputs of the convolution branch can contain unnecessary information that should not be propagated to the next layer. Thus, we add rescaling block that is applied to the RPReLU output before connecting with residual information, getting the modification that is referred to as Single Block. It can be expressed using 6 by implying R3 ∈ {SE, Spat, Chan} and Ri = I for i = 3. Contrast between the results of RescalePost and Single blocks presented in Table 3 displays the importance of applying rescaling based on the full-precision input. RescalePost block performs rescaling based on the result of the binary block itself, having no direct impact on the input of the block. Yet Single block constructs attention map from the input, thus restricting unnecessary information before propagating further into the block. Single vs. Dual. Dual Block explores the possibility of combining different rescaling mechanisms to extract various features from the binary information. Its definition is quite similar to the Single Block, except R3 ∈ {SE, Spat, Chan} and R4 ∈ {SE, Spat, Chan}. Quantitative results presented in Table 4 demonstrate that applying two different rescaling mechanisms to the output simultaneously (Dual block) performs worse than just applying one mechanism (Single block). The reason behind this is the different purpose of the rescaling blocks: spatial rescaling aims to learn interchannel dependencies for each pixel, while channel and SE-rescaling disregard
474
I. Zharikov and K. Ovcharenko
Table 3. Quantitative comparison of Single and RescalePost blocks’ results. Models are evaluated using PSNR and SSIM metrics, calculated on 4 benchmark datasets Set5, Set14, B100, Urban100. Modification names’s suffix represents the rescaling block that is being used. Best viewed in zoom. Modification
Set5 PSNR
SSIM
Set14 PSNR
SSIM
B100 PSNR
SSIM
Urban100 PSNR
SSIM
32.01
0.892
28.47
0.778
27.47
0.732
25.70
0.772
x4 scale SRResNet
RescalePost RescalePost-Channel 31.76 −0.25 0.889 −0.003 28.28 −0.19 0.774 −0.004 27.36 −0.11 0.728 −0.004 25.47 −0.23 0.764 −0.008 RescalePost-SE 31.78 −0.23 0.890 −0.002 28.28 −0.19 0.774 −0.004 27.36 −0.11 0.728 −0.004 25.45 −0.25 0.763 −0.009 RescalePost-Spatial 31.66 −0.35 0.888 −0.004 28.26 −0.21 0.774 −0.004 27.33 −0.14 0.727 −0.005 25.36 −0.34 0.760 −0.012 Single
Single-Channel Single-SE Single-Spatial
31.76 −0.25 0.889 −0.003 28.28 −0.19 0.774 −0.004 27.36 −0.11 0.728 −0.004 25.45 −0.25 0.763 −0.009 31.74 −0.27 0.889 −0.003 28.26 −0.21 0.773 −0.005 27.35 −0.12 0.728 −0.004 25.43 −0.27 0.763 −0.009 31.68 −0.33 0.888 −0.004 28.29 −0.18 0.774 −0.004 27.34 −0.13 0.727 −0.005 25.38 −0.32 0.760 −0.012
x2 scale EDSR
38.05
RescalePost RescalePost-Channel RescalePost-SE RescalePost-Spatial
37.65 −0.40 0.975 −0.003 33.06 −0.51 0.963 −0.003 31.83 −0.36 0.977 −0.001 30.73 −1.31 0.905 −0.006 37.61 −0.44 0.976 −0.002 33.07 −0.50 0.963 −0.003 31.84 −0.35 0.977 −0.001 30.87 −1.17 0.906 −0.005 37.63 −0.42 0.976 −0.002 33.12 −0.45 0.963 −0.003 31.87 −0.32 0.977 −0.001 30.89 −1.15 0.906 −0.005
Single
37.64 −0.41 0.976 −0.002 33.07 −0.50 0.963 −0.003 31.83 −0.36 0.977 −0.001 30.78 −1.26 0.905 −0.006 37.62 −0.43 0.976 −0.002 33.05 −0.52 0.963 −0.003 31.84 −0.35 0.977 −0.001 30.80 −1.24 0.906 −0.005 37.65 −0.40 0.976 −0.002 33.12 −0.45 0.963 −0.003 31.88 −0.31 0.977 −0.001 30.89 −1.15 0.906 −0.005
Single-Channel Single-SE Single-Spatial
0.978
33.57
0.966
32.19
0.978
32.04
0.911
the spatial position. Combining these blocks can overcomplicate the information and slow down training process. Single vs. Both. In the baseline block, outputs of the binary convolution and activations from the previous layer are added to each other and have the same influence on the result. However, full-precision residual connections have information that cannot be effectively processed by the binary convolution due to its simple structure. Therefore, we need to provide a method to get the most important features from previous layer. Hence, we add an rescaling block to the residual branch of the convolutional block, obtaining the most advanced modification - Both Block. Both Block is obtained from 6 by replacing R4 ∈ {SE, Spat, Chan} and R6 ∈ {SE, Spat, Chan} with various rescaling blocks and by using Ri = I for other i. Corresponding results are presented in Table 5. Prevalence of the Both block performance shows that both residual and binary branches require information restriction based on the full-precision input.
Rescaling In Binary Neural Networks
475
Table 4. Quantitative comparison of Single and Dual blocks’ results. Models are evaluated using PSNR and SSIM metrics, calculated on 4 benchmark datasets Set5, Set14, B100, Urban100. Single block name’s suffix represents the rescaling block that is being used. For Dual block the first name is related to the block used for the residual connection and the second name corresponds to the block applied to output of PReLU function. Best viewed in zoom. Modification
Set5 PSNR
SSIM
Set14 PSNR
SSIM
B100 PSNR
SSIM
Urban100 PSNR
SSIM
SRResNet
32.01
0.892
28.47
0.778
27.47
0.732
25.70
0.772
Single Single-Channel Single-SE Single-Spatial
31.76 −0.25 0.889 −0.003 28.28 −0.19 0.774 −0.004 27.36 −0.11 0.728 −0.004 25.45 −0.25 0.763 −0.009 31.74 −0.27 0.889 −0.003 28.26 −0.21 0.773 −0.005 27.35 −0.12 0.728 −0.004 25.43 −0.27 0.763 −0.009 31.68 −0.33 0.888 −0.004 28.29 −0.18 0.774 −0.004 27.34 −0.13 0.727 −0.005 25.38 −0.32 0.760 −0.012
x4 scale
Dual
Dual-Channel-Spatial 31.76 −0.25 0.889 −0.003 28.28 −0.19 0.774 −0.004 27.36 −0.11 0.728 −0.004 25.45 −0.25 0.763 −0.009 Dual-SE-Channel 31.76 −0.25 0.889 −0.003 28.27 −0.20 0.773 −0.005 27.36 −0.11 0.728 −0.004 25.46 −0.24 0.764 −0.008 Dual-SE-Spatial 31.74 −0.27 0.889 −0.003 28.28 −0.19 0.774 −0.004 27.36 −0.11 0.728 −0.004 25.45 −0.25 0.763 −0.009
x2 scale EDSR
38.05
Single Single-Channel Single-SE Single-Spatial
37.64 −0.41 0.976 −0.002 33.07 −0.50 0.963 −0.003 31.83 −0.36 0.977 −0.001 30.78 −1.26 0.905 −0.006 37.62 −0.43 0.976 −0.002 33.05 −0.52 0.963 −0.003 31.84 −0.35 0.977 −0.001 30.80 −1.24 0.906 −0.005 37.65 −0.40 0.976 −0.002 33.12 −0.45 0.963 −0.003 31.88 −0.31 0.977 −0.001 30.89 −1.15 0.906 −0.005
0.978
33.57
0.966
32.19
0.978
32.04
0.911
Dual Dual-Channel-Spatial 37.60 −0.45 0.976 −0.002 33.10 −0.47 0.964 −0.002 31.84 −0.35 0.977 −0.001 30.85 −1.19 0.907 −0.004 Dual-SE-Channel 37.60 −0.45 0.975 −0.003 33.02 −0.55 0.964 −0.002 31.79 −0.40 0.977 −0.001 30.70 −1.34 0.905 −0.006 Dual-SE-Spatial 37.62 −0.43 0.977 −0.001 33.08 −0.49 0.963 −0.003 31.84 −0.35 0.977 −0.001 30.83 −1.21 0.906 −0.005
Table 5. Quantitative comparison of Single and Both blocks’ results. Models are evaluated using PSNR and SSIM metrics, calculated on 4 benchmark datasets Set5, Set14, B100, Urban100. Single block name’s suffix represents the rescaling block that is being used. For Both block the first name is related to the block used for the residual connection and the second name corresponds to the block applied to output of PReLU function. Best viewed in zoom. Modification
Set5 PSNR
SSIM
Set14 PSNR
SSIM
B100 PSNR
SSIM
Urban100 PSNR
SSIM
SRResNet
32.01
0.892
28.47
0.778
27.47
0.732
25.70
0.772
Single Single-Channel Single-SE Single-Spatial
31.76 −0.25 0.889 −0.003 28.28 −0.19 0.774 −0.004 27.36 −0.11 0.728 −0.004 25.45 −0.25 0.763 −0.009 31.74 −0.27 0.889 −0.003 28.26 −0.21 0.773 −0.005 27.35 −0.12 0.728 −0.004 25.43 −0.27 0.763 −0.009 31.68 −0.33 0.888 −0.004 28.29 −0.18 0.774 −0.004 27.34 −0.13 0.727 −0.005 25.38 −0.32 0.760 −0.012
Both
31.76 31.76 31.48 31.74 31.74 31.46 31.50 31.55 31.37
x4 scale
Both-Channel-Channel Both-Channel-SE Both-Channel-Spatial Both-SE-Channel Both-SE-SE Both-SE-Spatial Both-Spatial-Channel Both-Spatial-SE Both-Spatial-Spatial
−0.25 −0.25 −0.53 −0.27 −0.27 −0.55 −0.51 −0.46 −0.64
0.889 0.889 0.885 0.889 0.889 0.884 0.885 0.886 0.883
−0.003 −0.003 −0.007 −0.003 −0.003 −0.008 −0.007 −0.006 −0.009
28.26 28.27 28.12 28.26 28.27 28.11 28.13 28.15 28.07
−0.21 −0.20 −0.35 −0.21 −0.20 −0.36 −0.34 −0.32 −0.40
0.773 0.774 0.770 0.774 0.774 0.770 0.770 0.771 0.769
−0.005 −0.004 −0.008 −0.004 −0.004 −0.008 −0.008 −0.007 −0.009
27.35 27.35 27.26 27.35 27.36 27.25 27.26 27.28 27.21
−0.12 −0.12 −0.21 −0.12 −0.11 −0.22 −0.21 −0.19 −0.26
0.728 0.728 0.725 0.728 0.728 0.725 0.725 0.726 0.723
−0.004 −0.004 −0.007 −0.004 −0.004 −0.007 −0.007 −0.006 −0.009
25.44 25.46 25.19 25.45 25.45 25.16 25.16 25.20 25.10
−0.26 −0.24 −0.51 −0.25 −0.25 −0.54 −0.54 −0.50 −0.60
0.763 0.763 0.753 0.763 0.764 0.753 0.753 0.755 0.750
−0.009 −0.009 −0.019 −0.009 −0.008 −0.019 −0.019 −0.017 −0.022
x2 scale EDSR
38.05
Single Single-Channel Single-SE Single-Spatial
37.64 −0.41 0.976 −0.002 33.07 −0.50 0.963 −0.003 31.83 −0.36 0.977 −0.001 30.78 −1.26 0.905 −0.006 37.62 −0.43 0.976 −0.002 33.05 −0.52 0.963 −0.003 31.84 −0.35 0.977 −0.001 30.80 −1.24 0.906 −0.005 37.65 −0.40 0.976 −0.002 33.12 −0.45 0.963 −0.003 31.88 −0.31 0.977 −0.001 30.89 −1.15 0.906 −0.005
Both
37.61 37.57 37.56 37.57 37.62 37.60 37.67 37.68 37.68
Both-Channel-Channel Both-Channel-SE Both-Channel-Spatial Both-SE-Channel Both-SE-SE Both-SE-Spatial Both-Spatial-Channel Both-Spatial-SE Both-Spatial-Spatial
0.978
−0.44 −0.48 −0.49 −0.48 −0.43 −0.45 −0.38 −0.37 −0.37
0.974 0.974 0.975 0.973 0.975 0.976 0.975 0.976 0.976
33.57
−0.004 −0.004 −0.003 −0.005 −0.003 −0.002 −0.003 −0.002 −0.002
33.10 33.09 33.14 33.10 33.12 33.11 33.16 33.16 33.21
0.966
−0.47 −0.48 −0.43 −0.47 −0.45 −0.46 −0.41 −0.41 −0.36
0.962 0.963 0.963 0.963 0.963 0.964 0.964 0.964 0.964
32.19
−0.004 −0.003 −0.003 −0.003 −0.003 −0.002 −0.002 −0.002 −0.002
31.84 31.83 31.88 31.84 31.88 31.88 31.89 31.90 31.91
0.978
−0.35 −0.36 −0.31 −0.35 −0.31 −0.31 −0.30 −0.29 −0.28
0.977 0.977 0.977 0.977 0.977 0.977 0.977 0.977 0.977
32.04
−0.001 −0.001 −0.001 −0.001 −0.001 −0.001 −0.001 −0.001 −0.001
30.87 30.93 31.11 30.88 30.98 30.99 31.02 31.09 30.96
0.911
−1.17 −1.11 −0.93 −1.16 −1.06 −1.05 −1.02 −0.95 −1.08
0.905 0.905 0.905 0.905 0.906 0.906 0.907 0.907 0.906
−0.006 −0.006 −0.006 −0.006 −0.005 −0.005 −0.004 −0.004 −0.005
476
I. Zharikov and K. Ovcharenko
Table 6. Complexity analysis of the best proposed modifications in each group. MACs are computed on the input image of size 320 × 180. Best viewed in zoom. MACsb MACsf MACs
Modification x4 scale SRResNet Base Residual RescalePre-Channel RescalePost-Channel Single-Channel Dual-Channel-Spatial Both-SE-SE
0 67.95 67.95 67.95 67.95 67.95 67.95
146.08 78.13 79.29 79.29 79.29 79.4 78.37
146.08 79.19 80.35 80.35 80.35 80.46 79.43
0 67.95 67.95 67.95 67.95 67.95 67.95
79.06 11.11 11.23 11.23 11.23 12.39 11.35
79.06 12.17 12.29 12.29 12.29 13.45 12.41
x2 scale EDSR Base Residual RescalePre-Spatial RescalePost-Spatial Single-Spatial Dual-Channel-Spatial Both-Spatial-Spatial
Table 7. Quantitative comparison of the best of proposed modifications in each group. Models are evaluated using PSNR and SSIM metrics, calculated on 4 benchmark datasets Set5, Set14, B100, Urban100. RescalePre, RescalePost and Single blocks’ names’ suffix represents the rescaling block that is being used. For Dual and Both blocks the first name is related to the block used for the residual connection and the second name corresponds to the block applied to output of PReLU function. Modification
Set5 PSNR
SSIM
Set14 PSNR
SSIM
B100 PSNR
SSIM
Urban100 PSNR
SSIM
SRResNet
32.01
0.892
28.47
0.778
27.47
0.732
25.70
0.772
BBCU [24]
31.72 −0.29 0.889 −0.003 28.28 −0.19 0.774 −0.004 27.35 −0.12 0.728 −0.004 25.37 −0.33 0.761 −0.011
Base-Residual RescalePre-Channel RescalePost-Channel Single-Channel Dual-Channel-Spatial Both-SE-SE
31.68 31.56 31.76 31.76 31.76 31.74
x2 scale EDSR
38.05
BBCU [24]
37.62 −0.43 0.976 −0.002 33.07 −0.50 0.963 −0.003 31.84 −0.35 0.977 −0.001 30.73 −1.31 0.905 −0.006
Base-Residual RescalePre-Spatial RescalePost-Spatial Single-Spatial Dual-Channel-Spatial Both-Spatial-Spatial
37.60 37.61 37.63 37.65 37.60 37.68
x4 scale
−0.33 −0.45 −0.25 −0.25 −0.25 −0.27
0.888 0.887 0.889 0.889 0.889 0.889
−0.004 −0.005 −0.003 −0.003 −0.003 −0.003
0.978 −0.45 −0.44 −0.42 −0.40 −0.45 −0.37
0.976 0.975 0.976 0.976 0.976 0.976
28.27 28.19 28.28 28.28 28.28 28.27
−0.20 −0.28 −0.19 −0.19 −0.19 −0.20
33.57 −0.002 −0.003 −0.002 −0.002 −0.002 −0.002
33.07 33.06 33.12 33.12 33.10 33.21
0.774 0.772 0.774 0.774 0.774 0.774
−0.004 −0.006 −0.004 −0.004 −0.004 −0.004
0.966 −0.50 −0.51 −0.45 −0.45 −0.47 −0.36
0.963 0.963 0.963 0.963 0.964 0.964
27.35 27.30 27.36 27.36 27.36 27.36
−0.12 −0.17 −0.11 −0.11 −0.11 −0.11
32.19 −0.003 −0.003 −0.003 −0.003 −0.002 −0.002
31.85 31.82 31.87 31.88 31.84 31.91
0.728 0.726 0.728 0.728 0.728 0.728
−0.004 −0.006 −0.004 −0.004 −0.004 −0.004
0.978 −0.34 −0.37 −0.32 −0.31 −0.35 −0.28
0.977 0.977 0.977 0.977 0.977 0.977
25.38 25.27 25.47 25.45 25.45 25.45
−0.32 −0.43 −0.23 −0.25 −0.25 −0.25
32.04 −0.001 −0.001 −0.001 −0.001 −0.001 −0.001
30.80 30.69 30.89 30.89 30.85 30.96
0.761 0.757 0.764 0.763 0.763 0.764
−0.011 −0.015 −0.008 −0.009 −0.009 −0.008
0.911 −1.24 −1.35 −1.15 −1.15 −1.19 −1.08
0.905 0.905 0.906 0.906 0.907 0.906
−0.006 −0.006 −0.005 −0.005 −0.004 −0.005
Rescaling In Binary Neural Networks
5
477
Conclusion
This paper proposes several binary block modifications that improve the performance of Binary Neural Networks in Single Image Super Resolution task. Best results for each rescaling structure and their computation metrics are demonstrated in Table 7 and Table 6 respectively. According to the results we can see that rescaling block is an essential component of the BNN, as it helps to restrict the unnecessary information and provides the network with more effective training process that leads to better final performance. Quantitative results prove this hypothesis and show that binary neural networks actually suffer from abundance of the input information and perform better when being restricted. Future researches can be aimed at implementing different rescaling blocks into the BNN and further experiments on restricting information using lightweight full-precision parts.
References 1. Aakerberg, A., Nasrollahi, K., Moeslund, T.B.: Real-world super-resolution of faceimages from surveillance cameras. IET Image Process. 16(2), 442–452 (2022) 2. Agustsson, E., Timofte, R.: NTIRE 2017 challenge on single image superresolution: dataset and study. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 126–135 (2017) 3. Ahn, D., Kim, H., Kim, T., Park, E., Kim, J.J.: Searching for robust binary neural networks via bimodal parameter perturbation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2410–2419 (2023) 4. Bevilacqua, M., Roumy, A., Guillemot, C., Alberi-Morel, M.L.: Low-complexity single-image super-resolution based on nonnegative neighbor embedding (2012) 5. Dharejo, F.A., et al.: Multimodal-boost: multimodal medical image superresolution using multi-attention network with wavelet transform. IEEE/ACM Trans. Comput. Biol. Bioinform. (2022) 6. Esser, S.K., McKinstry, J.L., Bablani, D., Appuswamy, R., Modha, D.S.: Learned step size quantization. arXiv preprint arXiv:1902.08153 (2019) 7. Falkena, S., Jamali-Rad, H., van Gemert, J.: LAB: learnable activation binarizer for binary neural networks. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6425–6434 (2023) 8. Guo, N., Bethge, J., Meinel, C., Yang, H.: Join the high accuracy club on ImageNet with a binary neural network ticket. arXiv preprint arXiv:2211.12933 (2022) 9. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018) 10. Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution from transformed self-exemplars. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5197–5206 (2015) 11. Lang, Z., Zhang, L., Wei, W.: E2FIF: push the limit of binarized deep imagery super-resolution using end-to-end full-precision information flow. arXiv preprint arXiv:2207.06893 (2022) 12. Le, H., Høier, R.K., Lin, C.T., Zach, C.: AdaSTE: an adaptive straight-through estimator to train binary neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 460–469 (2022)
478
I. Zharikov and K. Ovcharenko
13. Le, P.H.C., Li, X.: BinaryViT: pushing binary vision transformers towards convolutional models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4664–4673 (2023) 14. Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4681–4690 (2017) 15. Li, K., et al.: Local means binary networks for image super-resolution. IEEE Trans. Neural Netw. Learn. Syst. (2022) 16. Lim, B., Son, S., Kim, H., Nah, S., Mu Lee, K.: Enhanced deep residual networks for single image super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 136–144 (2017) 17. Liu, Z., Shen, Z., Savvides, M., Cheng, K.-T.: ReActNet: towards precise binary neural network with generalized activation functions. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part XIV. LNCS, vol. 12359, pp. 143– 159. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6 9 18. Ma, Y., Xiong, H., Hu, Z., Ma, L.: Efficient super resolution using binarized neural network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2019) 19. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings Eighth IEEE International Conference on Computer Vision, ICCV 2001, vol. 2, pp. 416–423. IEEE (2001) 20. Qin, H., et al.: Forward and backward information retention for accurate binary neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2250–2259 (2020) 21. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-Net: ImageNet classification using binary convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part IV. LNCS, vol. 9908, pp. 525–542. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0 32 22. Wang, Z., Chen, J., Hoi, S.C.: Deep learning for image super-resolution: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3365–3387 (2020) 23. Wei, R., et al.: EBSR: enhanced binary neural network for image super-resolution. arXiv preprint arXiv:2303.12270 (2023) 24. Xia, B., et al.: Basic binary convolution unit for binarized image restoration network. arXiv preprint arXiv:2210.00405 (2022) 25. Xue, P., Lu, Y., Chang, J., Wei, X., Wei, Z.: IR2Net: information restriction and information recovery for accurate binary neural networks. arXiv preprint arXiv:2210.02637 (2022) 26. Zeyde, Roman, Elad, Michael, Protter, Matan: On single image scale-up using sparse-representations. In: Boissonnat, J.-D., et al. (eds.) Curves and Surfaces 2010. LNCS, vol. 6920, pp. 711–730. Springer, Heidelberg (2012). https://doi.org/10. 1007/978-3-642-27413-8 47 27. Zhang, Y., Chen, H., Chen, X., Deng, Y., Xu, C., Wang, Y.: Data-free knowledge distillation for image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7852–7861 (2021) 28. Zhang, Y., Wang, H., Qin, C., Fu, Y.: Learning efficient image super-resolution networks via structure-regularized pruning. In: International Conference on Learning Representations (2021)
Estimating the Transfer Learning Ability of a Deep Neural Networks by Means of Representations German I. Magai1 and Artem A. Soroka2(B) 1 HSE University, Moscow, Russia 2 National Research Nuclear University MEPhI, Moscow, Russia
[email protected]
Abstract. The basis of transfer learning methods is the ability of deep neural networks to use knowledge from one domain to learn in another domain. However, another important task is the analysis and explanation of the internal representations of deep neural networks models in the process of transfer learning. Some deep models are known to be better at transferring knowledge than others. In this research, we apply the Centered Kernel Alignment (CKA) method to analyze the internal representations of deep neural networks and propose a method to evaluate the ability of a neural network architecture to transfer knowledge based on the quantitative change in representations during the learning process. We introduce the Transfer Ability Score (TAs) measure to assess the ability of an architecture to effectively transfer learning. We test our approach using Vision Transformer (ViT-B/16) and CNN (ResNet, DenseNet) architectures in computer vision tasks in several datasets, including medical images. Our work is an attempt to explain the transfer learning process. Keywords: Transfer Learning · Knowledge Representation · Neural Networks
1 Introduction Excellent results of deep learning models are mainly achieved by fine-tuning models that are pre-trained on Big Data. Knowledge transfer is one of key approaches to achieving high performance. Models learn to transfer and generalize knowledge in one data field (target domain) using information obtained in another one (source domain). Despite the prevalence of transfer learning techniques, we do not fully understand the importance of the role of feature representations in knowledge transfer. An important task is the explanation and interpretation of internal feature representations in the transfer of knowledge. In our work, we consider the problem of estimation the efficiency of knowledge transfer and analyze the process of transfer learning from the point of view of the similarity of feature representations. The contribution of this research can be divided into several points. First, we propose a method for evaluating the ability of a particular deep neural network (DNN) architecture to transfer knowledge to a new domain. We also compare © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 479–488, 2023. https://doi.org/10.1007/978-3-031-44865-2_50
480
G. I. Magai and A. A. Soroka
different DNN architectures (ViTs and CNNs) on different tasks in terms of the ability to transfer knowledge and explore the dynamics of the similarities of internal features representations in the process of fine tuning.
2 Related Work The task of interpreting and explaining the internal representations of DNN models has been considered in many works. In the field of neural representation similarity estimation, the Centered Kernel Alignment method is the most common. CKA for representation similarity analysis is first applied in [1], where the block structure of CNN models is demonstrated. However, [2] describes the shortcomings of CKA. The paper [3] observes fundamental differences between self-attention based architectures and CNNs in terms of similarity of representations. And [4] explores the effects of depth and width on the structure of internal representations. In [5–7] it is argued that ViT has better transfer learning performance than CNN in the medical imaging task. There are different methods for assessing knowledge transfer [8–12]: H-score assesses transferability by solving the HGR maximum correlation problem. NCE uses conditional entropy to assess transfer ability. The LEEP score is determined by the log-likelihood of the expected empirical predictor that predicts the dummy label distributions for the target data in the source label space.
3 Methodology 3.1 Deep Neural Networks and Transfer Learning Let’s define the source domain DS = XS , PS xS as a pair of source image space S XS , from which data x1 , ..., xnS is obtained and the marginal distribution PS xS . The source task TS , of the pre-trained DNN f , is the approximation of the posterior label distribution PS YS |xS , which is used to infer the label of the input image xS ∈ XS . Let DT = XT , PT xT is the target domain that includes the space XT of target images and their marginal distribution PT xT . Let’s consider on the domain DT the target of a mapping F : XT → YT , based on the training taskTT , which is construction set x1T , y1T , ..., xnT , ynT of size n, where the images, x1T , ..., xnT , xiT ∈ XT obtained T T from the PT x , and yi ∈ YT is a label of the i-th image, i = 1, n, distribution
T , ..., y T YT = y(1) (K) is the set of labels, |YT | = K. One way to build a mapping F is to train a neural network (e.g. deep CNNs, RNNs, ViT) on XT . The deep neural network DNN θ (xi ) = yi is a non-linear parametric function that defines a mapping from the example xi space to the class labels y space. The neural network is defined by the composition of functions DNN = softmax ◯ f L ◯ … ◯ f 1 . Where functions f i , 1 ≤ i ≤ L, are called layer functions, θ is a set of parameters, softmax is the final activation function, L is the number of layers. The design paradigms of modern DNN model architectures are divided into architectures based on the convolution (CNN) and self-attention (ViT) [13]. Due to the large number of existing DNN architectures, the question arises as to whether each one is suitable for efficient transfer learning. This paper proposes a method to solve this problem.
Estimating the Transfer Learning Ability of a Deep Neural Networks
481
3.2 Centered Kernel Alignment (CKA) Analysis of Representations Let X be the distribution of neural activations of layer lx of the DNN model and Y be the distribution of activations on layer ly . In practice X and Y ∈ Rn×d batch of neural activations on layers with d neurons in response to a batch of n examples. HilbertSchmidt Independence Criterion (HSIC) is a kernel method for evaluating the statistical relationship between random variables X and Y [14]. {x 1 …x n } and {y1 …yn } are samples from X and Y. There exist k and l kernels, and kernel matrices K ij = k(x i ,x j ) = x i T x j and L ij = l(yi ,yj ) = yi T yj . HSIC can be calculated as follows: HSIC(X , Y ) =
1 (n − 1)
2 T T X tr(KL) = Y , cov 2 F
(1)
where tr is the trace matrix, cov is the covariance matrix, F is the Frobenius norm. The CKA(X, Y) is based on HSIC and shows how X and Y are similar to each other: CKA(X , Y ) = √
HSIC(X , Y ) , HSIC(X , X )HSIC(Y , Y )
(2)
For all layers of the DNN model, we calculate a pairwise CKA and build a CKA matrix M, where M[i, j] is the value of the CKA(X, Y) between the X and Y features representations on layers i and j, respectively. There are other similarity measures: LinearReg [15], SVCCA [16], PWCCA [17], RTD [18] etc. In many DNN models, one can observe a block structure of internal representations in the CKA matrix, which indicates the dissimilarity of the features of the first layers to the last layers [1, 3]. 3.3 Transfer Ability Score An important task is to develop a method for assessing the ability of a DNNs to transfer knowledge. We propose a Transferability score (TAs) - a measure of the ability of a DNN to transfer knowledge to a new domain. Consider the problem of transferring knowledge by the model with architecture Ak from source domain DS to target domain DT . The adaptation to the DT can be interpreted via evolving of the feature space on different layers. A slight change in feature representations on different layers during fine-tuning on domain DT indicates that the DNN has a high ability to transfer knowledge to a new domain. In contrast significantly change shows that the information extracted from DS is not enough to generalize knowledge to a new domain DT , or the domains are very different and a substantial change in the learned features representation is required. A low TAs value is an indication of less parameter change during DNN training. Let {Xm }nm=1 = {X1 , X2 ..Xn } is a set of representations for model DNN X with n1 layers trained on DS and {Ym }nm=1 = {Y1 , Y2 ..Yn } is a set of representations for model DNN Y with n2 layers fine-tuned on DT . Let’s define CKA matrix M 1 , where m1ij is the value of the CKA(Xi , Xj ) between the X representations on layers i and j. And CKA matrix M 2 , where m2ij is the value of the CKA(Xi , Yj ) between the X and Y representations on layers i and j, respectively. Let’s denote M = M1 − M2 , M shows how much the representations on different layers have changed after fine-tuned on the target domain. m ij - i,j-th element of matrix M . We estimate the ability of a model
482
G. I. Magai and A. A. Soroka
with Ak architecture to transfer knowledge (Transferability score - TAs) from the DS domain to the DT domain via a quantitative change in the feature space after fine-tuning n 2 and define it as TAs = i,j=1 |m ij |/n . The mij values show the absolute change in the similarities of representations. The lower the Transferability score, the greater the DNN model’s ability to transfer knowledge. In other words, Transferability score for architecture Ak is the difference between the representation similarities matrices for ({xsm , ysm }nm=1 , θ s ) and ({xtm , ytm }nm=1 , θ t ). In addition, the M matrix provides a visual understanding of how much the similarity of representations on different layers of the DNN has changed after fine tuning on data in the DT .
4 Experiments 4.1 Experimental Setup Target domain DT adaptation is the use of the features learned by the deep neural network in the DS domain to train the distribution of the training sample in the DT domain. We test ResNet-18, ResNet-34, ResNet-50, ResNet-101, DenseNet-121, DenseNet-161 and ViT-B/16 architecture models pre-trained on ImageNet-1k [19]. When fine-tuning DNN, we do not freeze intermediate layers, i.e. save the ability to change all layers during training. We analyze the ability of various DNN models to transfer knowledge to a new target domain on several datasets: Eurosat (ESAT) [20], PatchCamelyon (PCAM) [21], The Stanford cars dataset [22], Describable Textures (DTD) [23], CIFAR-10 [24]. Figure 1 shows examples of images from various datasets. For DNN training we used Adam stochastic optimizer, lr = 5 · 10–5 , batch size = 32.
Fig. 1. Examples of datasets. a) ImageNet-1k, b) CIFAR10, c) Stanford cars, d) Describable Textures, e) EuroSAT, f) PatchCamelyon.
4.2 Estimation of Deep Models Transfer Learning Abilities Comparing Deep Models with Transfer Learning Ability Perspective. As a result of our empirical research, we conclude that DNN architectures based on the self-attention mechanism (ViT) have a higher knowledge transfer learning ability to new domains than CNN architectures. It can also be observed visually that the CKA heatmap in ViT shows
Estimating the Transfer Learning Ability of a Deep Neural Networks
483
smaller changes in the features representation than in ResNet-50 (Fig. 2, 3). This explains why ViT is more successful at transfer learning than ResNet-50 - the representations on pre-trained ViT are of sufficient quality and do not need to be changed much. The depth of the model also affects the ability to adapt to the new domain. Figure 2 shows a table with comparisons of CKA heatmaps using different modern DNN architectures.
Fig. 2. Comparison of CKA heatmaps of different architectures. The top line is the CKA heatmap on the ImageNet dataset, the middle line is the CKA heatmap for a DNN fine-tuned on CIFAR10, the bottom line is the M’ map showing the difference between heatmaps, i.e. how much the representations have changed after the knowledge transfer to the CIFAR10 domain.
In the case of knowledge transfer in networks with ResNet architecture, significant changes can be observed in the deeper layers. This is due to the fact that in this architecture from first to last layers there is a transition from local (general) features to a class-specific, which is consistent with the research [7]. In addition, the direct links provide a direct connection only between the nearest NN blocks in depth. On the contrary, in DenseNet121, we can observe changes in all layers, because residual connections link the distant layers throughout the depth. The TAs indicator allows us to talk about more significant changes in the DenseNet-121 caused by the presence of deep residual connections, so that the backpropagation reaches the early layers more easily during training. In ViT [13], the attention heads on the first and intermediate blocks extract local and global features, while in CNN the first layers have access only to local (general),
484
G. I. Magai and A. A. Soroka
and deeper layers to global features (class-specific features). In ViT, we observe a slight change in representations, because when trained on Ds , the ViT model extracts more complete information from a large dataset and generalizes better, and when adapted to DT , the adaptation of the feature space is not so significant, which is consistent with the lower value of TAs. Table 1 summarizes the comparison of all architectures in the experiments in terms of TAs and test accuracy.
Fig. 3. Differences in the representations of the M’ heatmap for different target domains. The top line is the ResNet-50 architecture for EuroSAT, PatchCamelyon, Stanford cars, Describable Textures datasets, the bottom line is ViT-B/16 for the same datasets.
The success of transfer learning depends on the similarity between the Ds and DT : the more similar the data, the more effective the transfer of knowledge [10]. Figure 3 shows the difference between the CKA maps showing the difference between the source and fine-tuned models for different DT for the ResNet-50 and ViT-B/16 architectures. ImageNet’s DS partially includes information contained in DTD, CIFAR-10, and Stanford cars, so the representations do not change as much as for PatchCamelyon and Eurosat, which are very different from ImageNet. To adapt to the PatchCamelyon and Eurosat domains, the DNN model needs to learn new feature representations, which is strongly reflected in the M heatmaps. We capture the relationship between knowledge transferability and inter-domain difference through the Fréchet inception distance (FID) [25] measure based on Wasserstein distance between features in the last layer representations in InceptionNetV3. The FID shows difference between the Dt and Ds . Figure 4 shows the relationship between the FID score and the Transferability score on different datasets, for ViT-B/16 and ResNet-50 architectures. It can also be seen that the ViT-B/16 architecture changes representations less significantly than ResNet-50, which indicates that Vision Transformers are able to extract
Estimating the Transfer Learning Ability of a Deep Neural Networks
485
Fig. 4. Relationship between the FID score and the Transferability score.
more information from Ds and it is easier for ViT to adapt to DT . This is consistent with the greater accuracy of ViT models in knowledge transfer than CNN models. Table 1. Empirical comparison of DNN architectures on CIFAR-10 Dt . Architecture
Test accuracy
Number of layers
TA score
ResNet-18
82.1
68
0.2298
ResNet-34
83.2
106
0.2113
ResNet-50
86.7
151
0.1737
ResNet-101
89.2
287
0.1690
ResNet-151
93.7
372
0.1672
ViT-B/16
95.2
140
0.1428
DenseNet-121
84.3
433
0.2574
DenseNet-161
84.8
573
0.3054
Evolution of the TA Score During Learning. We show the dynamics of the Transferability score during fine-tuning to the new target domain DT . It can be observed that when the test accuracy stabilizes, the Transferability score values also stabilize. And it is also shown empirically that ViT’s TAs values are lower than ResNet’s during training. The fluctuations of TAs values at first epochs can be explained by the peculiarity of the stochastic optimization algorithm and different number of classes in DS and DT (Fig. 5).
486
G. I. Magai and A. A. Soroka
Fig. 5. Dynamics of TAs in the fine-tuning process, ResNet-50 and ViT-B/16. Red line - test accuracy, blue - TAs in the all values matrix M , green - only on the diagonal of the matrix M .
5 Discussion and Future Work In this paper, we touch upon the issue of interpreting the change in the similarity of internal features representations in the transfer learning process. We have proposed a method for assessing the ability of a DNN architecture to transfer knowledge from source domain to target domain based on the dynamics of feature representations. We also research the dynamics of internal representations in the learning process and the influence of the depth of neural networks on changes in the similarity of features. Our method also allows to visually assess on which layers of the DNN the representations change during the knowledge transfer. Based on TAs, we concluded that the Vision Transformer (ViT) architecture exhibits better knowledge transfer capability than CNN models. Improving our approach may be useful for choosing the optimal architecture. For future research, we propose to pay attention to the transfer of knowledge not only within the modality of images, but also cross-modality, for example, the use of features extracted from an image for an audio or text classification task. Acknowledgment. The work of G. Magai was supported by the HSE University Basic Research Program. The work of A. Soroka was performed in the Tensor Processors laboratory of the Mephius Full-cycle Microelectronics Design Center (NRNU MEPhI) and IVA Technologies (HiTech).
References 1. Kornblith, S., Norouzi, M., Lee, H., Hinton, G.: Similarity of neural network representations revisited. In: International Conference on Machine Learning, pp. 3519–3529. PMLR (2019) 2. Davari, M., Horoi, S., Natik, A., Lajoie, G., Wolf, G., Belilovsky, E.: On the inadequacy of CKA as a measure of similarity in deep learning. In: ICLR 2022 Workshop on Geometrical and Topological Representation Learning (2022) 3. Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision transformers see like convolutional neural networks? In: Advances in Neural Information Processing Systems, vol. 34, pp. 12116–12128 (2021)
Estimating the Transfer Learning Ability of a Deep Neural Networks
487
4. Nguyen, T., Raghu, M., Kornblith, S.: Do wide and deep networks learn the same things? Uncovering how neural network representations vary with width and depth. arXiv preprint arXiv:2010.15327v2 (2021) 5. Usman, M., Zia, T., Tariq, A.: Analyzing transfer learning of vision transformers for interpreting chest radiography. J. Digit. Imaging 35(6), 1445–1462 (2022) 6. Yang, J.: Leveraging CNN and vision transformer with transfer learning to diagnose pigmented skin lesions. Highlights Sci. Eng. Technol. 39, 408–412 (2023) 7. Ayana, G., et al.: Vision-transformer-based transfer learning for mammogram classification. Diagnostics 13(2), 178 (2023) 8. Nguyen, C., Hassner, T., Seeger, M., Archambeau, C.: LEEP: a new measure to evaluate transferability of learned representations. In: International Conference on Machine Learning, pp. 7294–7305. PMLR (2020) 9. Tran, A.T., Nguyen, C.V., Hassner, T.: Transferability and hardness of supervised classification tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1395–1405 (2019) 10. Bao, Y., et al.: An information-theoretic approach to transferability in task transfer learning. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 2309–2313 IEEE (2019) 11. Tan, Y., Li, Y., & Huang, S. L.: OTCE: a transferability metric for cross-domain crosstask representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15779–15788 (2021) 12. You, K., Liu, Y., Wang, J., & Long, M.: LogME: practical assessment of pre-trained models for transfer learning. In: International Conference on Machine Learning, pp. 12133–12143. PMLR (2021) 13. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017) 14. Ma, W.D.K., Lewis, J.P., Kleijn, W.B.: The HSIC bottleneck: deep learning without backpropagation. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) 15. Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014) 16. Raghu, M., Gilmer, J., Yosinski, J., Sohl-Dickstein, J.: SVCCA: singular vector canonical correlation analysis for deep learning dynamics and interpretability. Advances in Neural Information Processing Systems (2017) 17. Morcos, A., Raghu, M., Bengio, S.: Insights on representational similarity in neural networks with canonical correlation. In: Advances in Neural Information Processing Systems (2018) 18. Barannikov, S., Trofimov, I., Balabin, N., Burnaev. E.: Representation topology divergence: a method for comparing neural network representations. In: Proceedings of the 39th International Conference on Machine Learning, vol. 162, pp. 1607–1626. PMLR (2022) 19. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Fei-Fei, L.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015) 20. Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 12(7), 2217–2226 (2019) 21. Veeling, B.S., Linmans, J., Winkens, J., Cohen, T., Welling, M.: Rotation equivariant CNNs for digital pathology. In: Frangi, A., Schnabel, J., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11071, pp. 210–218. Springer, Cham (2018). https:// doi.org/10.1007/978-3-030-00934-2_24 22. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2013)
488
G. I. Magai and A. A. Soroka
23. Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014) 24. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009) 25. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in neural information processing systems, vol. 30 (2017)
Russian Language Speech Generation from Facial Video Recordings Using Variational Autoencoder Miron M. Leonov(B) , Artem A. Soroka, and Alexander G. Trofimov National Research Nuclear University MEPhI (Moscow Engineering Physics Institute), Moscow, Russia [email protected]
Abstract. This paper describes the use of a variational autoencoder (VAE) based generative adversarial neural network (GAN) to generate Russian language speech from facial video recordings. The proposed system uses the VAE to learn a lowdimensional representation of the input video frames and the corresponding speech signals. A discriminator is used for regularization during training. The model is trained on Russian-language speech and video recordings of a speaking person collected from video hosting sites. The results show that the proposed method can generate high quality speech signals that are close to the original speech. The system is also able to generate speech with different emotions and speaking styles, demonstrating its potential for use in speech synthesis applications. Overall, the proposed method provides a promising approach for generating speech from video recordings of a face, which could have important applications in areas such as the cinematography, the game development, virtual reality systems, rehabilitation, and medicine. Keywords: Variational Autoencoder · Generative-Adversarial Network · Speech Synthesis · Multimodal Machine Learning
1 Introduction One of the recent trends in deep learning is the development and analysis of multimodal models, i.e., models that are able to process data of different modalities such as images and text, text and audio, time series and video [1–3]. Multimodal models are used in autonomous vehicle control systems, brain-computer interfaces, cognitive researches and many other applications. Unlike deep learning models that deal with data from a single modality, multimodal learning has a number of specific challenges, in particular, multimodal data embedding, aggregating multimodal embeddings, finding correspondence between data of different modalities, and knowledge transfer from one modality to another. The generation of human speech based on facial video recordings is one of the applied tasks of multimodal machine learning considered in this paper. This task encounters in © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 489–498, 2023. https://doi.org/10.1007/978-3-031-44865-2_51
490
M. M. Leonov et al.
various applications such as cinematography, game development, virtual reality systems, rehabilitation, and medicine. The interest in the task of sound generation from video using deep learning is confirmed by the large number of studies. In [4] a network for exploring associations between different scenes in a video and audio is presented. Papers [5] and [6] study the relationship between visual information and sound based on unsupervised learning on unlabeled videos. The works [1, 7, 8] implement the separation of audio signals using audiovisual information. In [9], a system for tracking a moving vehicle based on audiovisual relations is presented. Some other studies include audiovisual segmentation [10] and audiovisual navigation [11]. The tasks of human speech generation [12–14] and natural sounds generation [2, 15] from video recordings and the corresponding approaches and models used can be divided into two groups depending on the interpretation of audio signal as a reference or as signal that contains extra information irrelevant to the task. If the audio signal is considered as reference, the dense or convolutional neural network architectures are used, and the performance of the proposed method can be precisely evaluated with common metrics such as MSE or MAE. In [14] the speech generation problem is solved assuming absence of extra audio information in the groundtruth data, which allows the authors to solve the problem without using of generative or variational network architectures. If the original audio track contains information that should not be in the result of generation, the architectures of generative-adversarial neural networks and transformer networks are effectively used – in this case, the accuracy of the obtained model is evaluated empirically. In [15], the task is to describe events on video (e.g., a dog barking), and the training data used were sounds, containing a certain amount of irrelevant audio information (e.g., people communicating in the background). The authors, proposed a generative-adversarial neural network and to evaluate the results they use empirical methods, one of the most widespread is an expert survey. One of the specific challenges of the speech generation is the presence of untraceable stochastic variations that means that the same person can pronounce semantically the same sounds in different manner, using facial expressions and intonations. To treat this problem, in this paper it is proposed to use a variational autoencoder [16], which makes it possible to represent latent states (embeddings of multimodal data) in the form of probability distributions. Due to this possibility the variational autoencoder can be successfully applied to information generation problems [17] and to processing multimodal data, in particular, sound and images [18].
2 Methodology 2.1 Problem Statement A model solving the problem of generating an audio recording from a video can be described as a function G : V → A, where V is the set of video recordings (timeordered image sequences) of arbitrary duration, and A is the set of corresponding audio recordings (speech signals). It’s supposed that video recordings show the face of a speaking person in unchanged surroundings.
Russian Language Speech Generation from Facial Video Recordings
491
Generally, when working with audio information, a transition is made from audio signals to mel-spectrograms that are representation of sound in the form closest to human perception [19]. Later the mel-spectrogram corresponding to a video recording of a speaking person is understood as mel-spectrogram of the speech spoken by this person. Each video fragment can be represented as a pair (v, m), where v ∈ V is a sequence of images containing only the face of the speaking person, m ∈ M is the corresponding mel-spectrogram, M is the domain of mel-spectrograms. Let the function G be the result of sequential processing of 1-s video segments. Let’s denote the 1-s segment generation function by g. The duration (1 s) is chosen empirically, based on the optimal ratio of model complexity for generation and quality of audio data recovery. Thus, a simulated speech signal is generated as a result of applying the function g to sequential video segments of one second duration, which in conjunction form an audio recording of any length. Assume that the function g is the composition g = R ◦ , where R is the inverse transformation of a mel-spectrogram into a 1-s audio signal, is the transformation of a 1-s video recording into a mel-spectrogram. So, a = R((v)), where (v) is the mel-spectrogram of the video recording v ∈ V , and a ∈ A is the corresponding audio signal. To approximate the function : V → M , we propose to use a VAE-based generative adversarial neural network. The transformation performed by the proposed model is a composition d ◦ r ◦ eM : M → M . Here eM : M → P is a function that calculates the parameters (vector of expected values μ and vector of standard deviations σ, (μ, σ) ∈ P) of the distribution of the latent variables characterizing the mel-spectrogram m ∈ M ; r : P → Z is the random vector generator from a distribution with given parameters, where Z is the set of random vectors of a given dimension; d : Z → M (decoder) is a transformation of latent variables vector into a mel-spectrogram. The function r implements the reparameterization trick: it takes a vector of random values y from a multivariate normal distribution N (0, 1), multiplies y by σ and sums with μ, where (μ, σ) ∈ P. ˆ : V → M as a composition d ◦ r ◦ eV : V → M , where eV : V → P Let’s define is the transformation of video clip v ∈ V into the parameters (expected value μ and standard deviation σ, (μ, σ) ∈ P)) of the distribution of latent variables characterizing the video clip. After training a neural network that approximates eM (audio encoder) on a sample of mel-spectrograms, the resulting model is used in training the function eV (video encoder) with according to the Kullback-Leibler criterion: KL(eV , eM ) → min .
(1)
Thus, as soon as the encoder eV is trained, its parameters are adjusted to achieve the closeness of the distributions formed by it and the encoder eM . The use of the auxiliary function eM is due to the surjection of the mapping V → M , and assuming the bijection of the mapping V → M , we can predict the distributions of the latent variables z ∈ Z, which will allow us to approximate the function eV . This step-by-step method is to find a neural network model capable to generate a latent representation of the mel-spectrograms m ∈ M , which allows us to reconstruct the audio recordings a ∈ A with high accuracy.
492
M. M. Leonov et al.
Note, that application of commonly used approximation quality metrics to melˆ for the video fragment v ∈ V can spectrogram obtained as the output of the model be biased. This phenomenon is caused by the peculiarities of human speech perception, for example, different voice pitch on audio recording will show significantly different spectrogram, but it will be equally well perceived by human. In this regard, this paper proˆ poses an empirical study (survey) to evaluate the quality of the approximation model , in which people evaluate the consistency of an audio recording derived from a generated mel-spectrogram with a video fragment according to a set of criteria. 2.2 Learning Algorithm Audio Encoder and Decoder. A combination of the mean squared error (MSE) and binary cross-entropy (BCE) of the discriminator is used as the error function for the training of audio encoder eM and decoder d : 1 (mi − m ˆ i )2 + β · log(D(m)), ˆ n n
L1 = α ·
(2)
i=1
where n is the number of elements in the mel-spectrograms, mi and m ˆ i are the i-th elements of the original and generated mel-spectrograms respectively, D is the discriminator function to be described below, α and β are weight coefficients. The weights α and β are chosen so that in the early stages of learning the contribution of the MSE significantly exceeds the contribution of the BCE, and in the later stages the contributions of MSE and BCE are equal. This logic is driven by the need for rapid adaptation of the model to the required range of output data (matrix describing the mel-spectrogram), which is achieved by the greater influence of the MSE in the early stages of training; subsequently, the comparable influence of the errors avoids overfitting and ensures the robustness of the model. Discriminator. The discriminator (function D) is the neural network we use for regularization. It’s needed because the using only MSE as the error function could lead to averaging of the generated mel-spectrograms, that would be harmful to the generalization capabilities of . The error function for the discriminator is a binary cross-entropy [20]: L2 = −y log(p) + (1 − y) log(1 − p),
(3)
where y is the target spectrogram label (1 if the spectrogram is genuine and 0 if the spectrogram is generated by ), p is the discriminator score. Video Encoder. The pre-trained audio encoder eM is used to train the video encoder eV . We relate the latent distributions of the audio encoder and the video encoder using the Kullback-Leibler divergence measure: L3 =
n i=1
p(zi ) log
p(zi ) , q(zi )
(4)
Russian Language Speech Generation from Facial Video Recordings
493
where zi is the i-th element of the latent state vector z ∈ Z generated by the video encoder eV ; p and q are probability density functions of the latent variables with the distribution parameters generated by audio encoder eM and video encoder eV respectively. The training and inference scheme of the proposed VAE-based neural network model with discriminator is shown in Fig. 1. We used residual connections in the encoder implementation to increase the robustness of the resulting model and solve the vanishing gradient problem. For the evaluation of our method, we have trained and tested three additional models: ˆ is approximated by the 1. Autoencoder (AE) based generative model. In this model, ˆ = d ◦ eV , where e : V → H , d : H → M , H is autoencoder eV and decoder d : V n (mi − m ˆ i )2 was used the autoencoder hidden space and the loss function L1 = 1n i=1
for training. 2. Non-regularized VAE-based generative model. The VAE model described above was trained with the L1 loss function. 3. Non-regularized VAE-based generative model with residual connections. The resnetlike architecture [21] was used as variational autoencoder and L1 as loss function. These models help us evaluate the influence of specific elements of the model on the learning process and on the accuracy of the model as a whole.
Fig. 1. Training (a) and inference (b) pipelines of proposed neural network model.
494
M. M. Leonov et al.
3 Experiments and Results Since there is no public available dataset suitable for our requirements (video without irrelevant noise and full-face footage of a person speaking in Russian), we have created our own dataset for model training and experimental research that included videoclips from video hosting sites. We selected three speakers and collected three hours of video footage with each of them. The main metrics used to evaluate the accuracy of mel-spectrogram generation were STOI [22] and PESQ [23]. MSE and MAE were also used to evaluate the approximation quality. The STOI score involves comparing the temporal correlations between the original signal and its processed version. The signals are divided into short time intervals, called frames. Then the correlation between the frames of the original and processed signals is calculated. The higher the correlation, the more understandable and intelligible the processed signal is considered [22]. The PESQ measures the difference between the original and reconstructed audio signals taking into account human perception of these signals. It is based on a model of human hearing, which considers the psychoacoustic characteristics of sound perception [23]. We have trained and tested four our models described above. Each model has been trained for 300 epochs on the same dataset. The training/test data splitting ratio was 20:1. The results of the models on the test dataset are presented in Table 1. The better values of STOI and MSE of AE model are due to the generation of speechless uniform noise by this model, which generates a biased score on the averaged estimates. In fact, the model failed to minimize the error function L1 , as confirmed by the low PESQ value. Table 1. Quality metrics of speech generation from facial video recordings using different models. Model
STOI↑
PESQ↑
MSE↓
1. AE 2. Non-regularized VAE
MAE↓
0.258
0.689
0.461
0.164
0.109
1.041
0.547
0.162
3. Non-regularized VAE + Residual connections
0.123
1.047
0.564
0.162
4. VAE + Residual connections + Discriminator
0.127
1.057
0.549
0.161
Despite the similar values of MSE on the training data as a result of training (Fig. 2), models 3 (non-regularized VAE + residual connections) and 4 (VAE + residual connections + discriminator) show higher values of PESQ and STOI metrics on the test data. This effect is caused by the robustness of the residual neural network model and the regularization provided by the discriminator. Using the residual neural network model accelerates the learning process of the video encoder, which is shown in Fig. 3. Figure 4 shows the outputs of the convolution blocks of the video encoder. It can be noticed that the main focus is on the lips of the person in the video. This indicates
Russian Language Speech Generation from Facial Video Recordings
495
Fig. 2. Losses of the models during training.
that the pixel values in the lip area of the speaker have the major influence on the latent space representation.
Fig. 3. Losses of video encoders during training.
Fig. 4. Outputs of video encoder convolution blocks.
To empirically assess the quality of the audio recordings generated by the model, we conducted a survey in a group of 10 people (men and women aged 20–27). The respondents had to watch three video clips (one for each speaker from the data set), voiced using the developed model. The participants rated the performance of the method from 0 to 5 on each of the criteria:
496
M. M. Leonov et al.
1. speech intelligibility; 2. sound quality; 3. synchronization of the reconstructed speech with the speaker’s lip movements. The mean values and standard deviations (SD) of the estimates are presented in Table 2. Table 2. The results (mean and SD) of the survey according to several criteria. # of video
intelligibility
quality
synchronization
1
2.4 ± 0.18
2.9 ± 0.15
2.8 ± 0.16
2
2.8 ± 0.21
3.0 ± 0.1
3.2 ± 0.17
3
2.7 ± 0.19
2.7 ± 0.18
3.5 ± 0.22
Thus, the overall average score of our respondents (2.9 ± 0.17 on a scale of {0…5})) indicates that the proposed model is capable to generate human understandable Russian language speech pretty well.
4 Conclusion This paper sets the problem of speech generation based on video recording of a human face and proposes four neural network models to solve it, three of which are VAEbased. As a result of experimental research it was found that the best model is the variational autoencoder combined with the discriminator and the residual neural network architecture of the encoder. To evaluate the quality of human perception of the generated Russian language speech, several respondents were interviewed who rated the performance of the proposed models according to several criteria. The estimation that takes into account human perception of sound (PESQ = 1.057) and the scores obtained as a result of the interview (2.9 ± 0.17 on a scale of {0…5}), indicates a high performance of the proposed model in generating human understandable speech. One of the directions of future work is to improve the proposed architecture of the neural network models, in particular making them deeper and using the transformer architecture to increase the values of key metrics (STOI, PESQ). Another direction is the modification of the training and inference pipelines, so that they can be applied for multiple speakers. Acknowledgment. This work was performed in the laboratory “Tensor Processors” of the MEPHIUS Microelectronics Design Center (National Research Nuclear University MEPhI) and IVA Technologies (HiTech Company).
Russian Language Speech Generation from Facial Video Recordings
497
References 1. Tibebu, H., Malik, A., De Silva, V.: Text to image synthesis using stacked conditional variational autoencoders and conditional generative adversarial networks. In: Arai, K. (ed.) SAI 2022. LNNS, vol. 506, pp. 560–580. Springer, Cham (2022). https://doi.org/10.1007/978-3031-10461-9_38 2. Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L.: Visual to sound: generating natural sound for videos in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3550–3558 (2018) 3. Gan, C., Huang, D., Zhao, H., Tenenbaum, J.B., Torralba, A.: Music gesture for visual sound separation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10478–10487 (2020) 4. Harwath, D., Torralba, A., Glass, J.: Unsupervised learning of spoken language with visual context. In: Advances in Neural Information Processing Systems, vol. 29 (2016) 5. Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 609–617 (2017) 6. Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: Learning sound representations from unlabeled video. In: Advances in Neural Information Processing Systems, vol. 29 (2016) 7. Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 587–604. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-012465_35 8. Zhao, H., Gan, C., Ma, W.C., Torralba, A.: The sound of motions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1735–1744 (2019) 9. Gan, C., Zhao, H., Chen, P., Cox, D., Torralba, A.: Self-supervised moving vehicle tracking with stereo sound. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7053–7062 (2019) 10. Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., Torralba, A.: Self-supervised audiovisual co-segmentation. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2357–2361. IEEE (2019) 11. Gan, C., Zhang, Y., Wu, J., Gong, B., Tenenbaum, J.B.: Look, listen, and act: towards audio-visual embodied navigation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 9701–9707. IEEE (2020) 12. Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 8717–8727 (2018) 13. Faisal, M., Manzoor, S.: Deep learning for lip reading using audio-visual information for Urdu language. arXiv preprint arXiv:1802.05521 (2018) 14. Prajwal, K.R., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.V.: Learning individual speaking styles for accurate lip to speech synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13796–13805 (2020) 15. Chen, P., Zhang, Y., Tan, M., Xiao, H., Huang, D., Gan, C.: Generating visually aligned sound from videos. IEEE Trans. Image Process. 29, 8292–8302 (2020) 16. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006) 17. Chira, D., Haralampiev, I., Winther, O., Dittadi, A., Liévin, V.: Image super-resolution with deep variational autoencoders. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) ECCV 2022. LNCS, vol. 13802, pp. 395–411. Springer, Cham (2022). https://doi.org/10.1007/978-3-03125063-7_24 ˙ 18. Zelaszczyk, M., Ma´ndziuk, J. Audio-to-image cross-modal generation. In: 2022 International Joint Conference on Neural Networks (IJCNN),pp. 1–8. IEEE (2022)
498
M. M. Leonov et al.
19. Stevens, S.S., Volkmann, J., Newman, E.B.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8(3), 185–190 (1937) 20. Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020) 21. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 22. Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4214–4217. IEEE (2010) 23. Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), vol. 2, pp. 749–752. IEEE (2001)
Author Index
A Akhmadjonov, Mumtozbek 39 Al Adel, Arij 93 Alexandrov, Igor O. 323 Alexandrov, Yuri I. 323 Antsiperov, Viacheslav E. 121 Arapov, Vasiliy V. 323 Artamonov, Igor M. 226 Artamonova, Yana N. 226 B Bakaev, Maxim A. 188 Bakhshiev, Aleksandr 346 Barinov, Oleg G. 398 Barinov, Oleg 370 Bartsev, Sergey I. 206 Beskhlebnova, Galina A. 257 Bogdanova, Elizaveta A. 389 Bredikhin, Dmitry O. 149 Bulava, Alexandra I. 323 Burikov, Sergey 445 Burlakov, Evgenii 165 C Chistyakova, Maria 239 Chulin, Maxim I. 406 D Dolenko, Sergey 279, 445, 455 Dolenko, Tatiana 445 Dorofeev, Vladislav 308 E Ekizyan, A. Kh. 157 Engel, Ekaterina A. 362 Engel, Nikita E. 362 F Fomin, Ivan
346
G Gapanyuk, Yuriy E. 435 Gerasimov, Anton K. 188 Gorkin, Alexander G. 323 Grechenko, Tatiana N. 323 Gurtovoy, Konstantin 135, 165 Guskov, Artem 445 H Huang, Yaowen
32
I Isaev, Igor 279, 445, 455 Ivanitsky, Alexander 111 Ivanov, Dmitry 111 Ivanova, Victoria V. 337 Ivanova, Viktoria 346 K Kabir, A. S. M. Humaun 54 Kaganov, Yuriy T. 61 Kanev, Anton I. 32, 380 Karchkov, Denis 13 Karimov, Elvir Z. 398 Kazantsev, Victor B. 83 Kharlamov, Alexander Alexandrovich Khoroshilov, Dmitry A. 149 Kireev, Maxim 196 Kiroy, V. N. 157 Kiselev, Mikhail 111 Klucharev, Vasily A. 149 Knyazeva, Irina 196 Kolonin, Anton 3 Konovalov, Vasily 22 Korotkov, Alexander 196 Korsakov, Anton 346 Kostulin, D. V. 157 Kotov, Vladimir B. 247, 257 Kozin, Alexey V. 188 Krasnov, Alexander 13 Kunitsyn, Dmitry E. 102
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2023, SCI 1120, pp. 499–501, 2023. https://doi.org/10.1007/978-3-031-44865-2
54
500
Author Index
Kupriyanov, Gavriil 279 Kuznetsov, Denis 39 L Laptinskiy, Kirill 445 Larionov, Denis 111 Lazovskaya, Tatiana 239 Lekhnitskaya, Polina A. 173 Leonov, Miron M. 489 Levanova, Tatiana A. 293 M Magai, German I. 479 Markova, Galiya M. 206 Masharipov, Ruslan 196 Medvedeva, Tatiana 196 Minakov, Grigory 39 Moiseeva, Victoria V. 149 Monahhova, Eliana 149 Morozova, Alexandra N. 149 Moskalenko, Viktor 13 Myagkova, Irina N. 398 Myagkova, Irina 370 N Novoseletsky, Valery N.
389
O Obornev, Eugeny 455 Obornev, Ivan 455 Osipov, Grigory 13 Osipova, Zhanna A. 323 Ovcharenko, Kirill 467 P Palamarchuk, Veronika 239 Pavlov, Alexander V. 188 Pavlyukova, Elena R. 121 Podoprigorova, Natalya S. 380 Putrolaynen, Vadim V. 102 R Rabcevich, Ksenia R. 380 Razin, Vyacheslav 13
Razumov, Egor 239 Red’ko, Vladimir G. 267 Revunkov, Georgiy I. 435 Rodionov, Denis 13 Rodionov, Eugeny 455 Rybka, Roman B. 102
S Sakharova, Elizaveta K. 32 Savchenko, Grigory A. 380 Sboev, Alexander G. 102 Serenko, Alexey V. 102 Sergeeva, Anna 239 Shaitan, Konstantin V. 389 Shaposhnikov, D. G. 157 Shaposhnikov, P. D. 157 Shelomentseva, I. G. 355 Shestakova, Anna N. 149 Shikohov, Andrey N. 380 Shimelevich, Mikhail 455 Shirokiy, Vladimir R. 398 Shirokiy, Vladimir 370 Silkis, Isabella G. 179 Smirnitskaya, I. A. 314 Smirnov, Lev 13 Sofronova, Olga 214 Sokhova, Zarema B. 247, 267 Soroka, Artem A. 479, 489 Stankevich, Lev A. 141 Stasenko, Sergey V. 83, 293
T Taran, Maria O. 435 Tarasov, Andrey V. 380 Tarkhov, Dmitriy 239 Tarkov, Mikhail S. 337 Terekhov, Valery I. 32 Tirskikh, Danil 22 Tiumentsev, Yury V. 406, 420 Trofimov, Alexander G. 489 Tshay, Roman A. 420
U Ushakov, Vadim
165
Author Index
V Verkhlyutov, Vitaly 135, 165 Vladimirov, Roman 370 Voronkov, Ilia Mikhailovich 54 Vvedensky, Victor 135, 165 Y Yudin, Dmitry 72
501
Z Zarubin, Ruslan A. 406 Zhang, Che 32 Zhang, Huzhenyu 72 Zharikov, Ilia 467 Zharikova, Dilyara 214 Zhilyakova, Liudmila 300 Zolotykh, Nikolai 13