460 26 103MB
English Pages [474] Year 2021
Recognition and Perception of Images
Scrivener Publishing 100 Cummings Center, Suite 541J Beverly, MA 01915-6106 Publishers at Scrivener Martin Scrivener ([email protected]) Phillip Carmical ([email protected])
Recognition and Perception of Images Fundamentals and Applications
Edited by
Iftikhar B. Abbasov
This edition first published 2021 by John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA and Scrivener Publishing LLC, 100 Cummings Center, Suite 541J, Beverly, MA 01915, USA © 2021 Scrivener Publishing LLC For more information about Scrivener publications please visit www.scrivenerpublishing.com. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions. Wiley Global Headquarters 111 River Street, Hoboken, NJ 07030, USA For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com. Limit of Liability/Disclaimer of Warranty While the publisher and authors have used their best efforts in preparing this work, they make no rep resentations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchant- ability or fitness for a particular purpose. No warranty may be created or extended by sales representa tives, written sales materials, or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further informa tion does not mean that the publisher and authors endorse the information or services the organiza tion, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Library of Congress Cataloging-in-Publication Data ISBN 9781119750550 Cover image: Face Recognition (Scharfsinn86 | Dreamstime.com) Cover design by Kris Hackerott Set in size of 11pt and Minion Pro by Manila Typesetting Company, Makati, Philippines Printed in the USA 10 9 8 7 6 5 4 3 2 1
Contents Abstract xiii Preface xv 1 Perception of Images. Modern Trends 1 Iftikhar B. Abbasov 1.1 Visual System 1 1.1.1 Some Modern Research 1 1.1.2 Light Perception 5 1.1.3 Vertebrate Eye Anatomy 5 1.1.4 Projection Areas of the Brain 8 1.2 Eye. Types of Eye Movement 10 1.2.1 Oculomotor Muscles and Field of View 10 1.2.2 Visual Acuity 10 1.2.3 Types of Eye Movement 12 1.2.4 Effects of Masking and Aftereffects 21 1.2.5 Perception of Contour and Contrast 23 1.2.6 Mach Bands, Hermann’s Grid 24 1.2.7 Light Contrast 27 1.2.8 Object Identification 27 1.2.9 Color Vision Abnormalities 28 1.3 Perception of Figures and Background 33 1.3.1 Dual Perception of the Connection “Figure-Background” 34 1.3.2 Gestalt Grouping Factors 35 1.3.3 Subjective Contours 40 1.3.4 The Dependence of Perception on the Orientation of the Figure 43 1.3.5 The Stroop Effect 45 1.4 Space Perception 46 1.4.1 Monocular Spatial Signs 46 1.4.2 Monocular Motion Parallax 48 v
vi Contents 1.4.3 Binocular Signs 1.4.4 Binocular Disparity and Stereopsis 1.5 Visual Illusions 1.5.1 Constancy Perception 1.5.2 The Development of the Process of Perception 1.5.3 Perception after Surgery Insight 1.5.4 Illusion of the Moon 1.5.5 Illusions of Muller-Lyer, Ponzo, Poggendorf, Zolner 1.5.6 Horizontal – Vertical Illusion 1.5.7 Illusions of Contrast 1.6 Conclusion References
48 49 49 51 52 53 54 55 57 57 60 60
2 Image Recognition Based on Compositional Schemes Victoria I. Barvenko and Natalia V. Krasnovskaya 2.1 Artistic Image 2.2 Classification of Features 2.3 Compositional Analysis of an Art Work 2.4 Classification by Shape, Position, Color 2.5 Classification According to the Content of the Scenes 2.6 Compositional Analysis in Iconography 2.7 Associative Mechanism of Analysis 2.8 Conclusions References
63
3 Sensory and Project Images in the Design Practice Anna A. Kuleshova 3.1 Sensory Image Nature 3.2 Language and Images Symbolics 3.3 Methods of Images Production in Ideas 3.4 Personality Image Projecting 3.5 Project Image 3.6 Conclusion References
89
4 Associative Perception of Conceptual Models of Exhibition Spaces Olga P. Medvedeva 4.1 Associative Modeling of the Exhibition Space Environment 4.1.1 Introduction 4.1.2 Conceptual and Terminological Apparatus of Conceptual Modeling and Shaping
63 69 71 73 76 80 83 86 86
89 96 102 106 108 120 121 125 125 125 127
Contents vii 4.1.3 Compositional and Planning Basis for Creating the Environment of Exhibition Spaces 4.1.4 Scenario Approach in the Figurative Solution of Environmental Spaces 4.1.5 Conceptual Approach to Creating Exhibition Spaces 4.1.6 Perception of the Figurative Solution of the Environment 4.2 Associative Modeling of Environmental Objects in Exhibition Spaces 4.2.1 Conceptual and Figurative Basis for the Formation of Environmental Objects 4.2.2 Associative and Imaginative Modeling of the Environmental Objects 4.2.3 Cognitive Bases of Perception of Associative-Figurative Models of Objects in Environmental Spaces 4.2.4 Perception of the Figurative Solution of an Environmental Object 4.2.5 Options of Conceptual and Figurative Modeling of Objects in Environmental Spaces 4.3 Conclusion References 5 Disentanglement For Discriminative Visual Recognition Xiaofeng Liu 5.1 Introduction 5.2 Problem Statement. Deep Metric Learning Based Disentanglement for FER 5.3 Adversarial Training Based Disentanglement 5.4 Methodology. Deep Metric Learning Based Disentanglement for FER 5.5 Adversarial Training Based Disentanglement 5.5.1 The Structure of Representations 5.5.2 Framework Architecture 5.5.3 Informative to Main-Recognition Task 5.5.4 Eliminating Semantic Variations 5.5.5 Eliminating Latent Variation 5.5.6 Complementary Constraint 5.6 Experiments and Analysis 5.6.1 Deep Metric Learning Based Disentanglement for FER 5.6.2 Adversarial Training-Based Disentanglement
128 128 129 129 134 134 134 135 136 136 141 141 143 144 149 152 154 159 159 160 160 161 162 162 162 162 169
viii Contents 5.7 Discussion 5.7.1 Independent Analysis 5.7.2 Equilibrium Condition 5.8 Conclusion References 6 Development of the Toolkit to Process the Internet Memes Meant for the Modeling, Analysis, Monitoring and Management of Social Processes Margarita G. Kozlova, Vladimir A. Lukianenko and Mariia S. Germanchuk 6.1 Introduction 6.2 Modeling of Internet Memes Distribution 6.3 Intellectualization of System for Processing the Internet Meme Data Flow 6.4 Implementation of Intellectual System for Recognition of Internet Meme Data Flow 6.5 Conclusion References 7 The Use of the Mathematical Apparatus of Spatial Granulation in The Problems of Perception and Image Recognition Sergey A. Butenkov, Vitaly V. Krivsha and Nataly S. Krivsha 7.1 Introduction 7.2 The Image Processing and Analysis Base Conceptions 7.2.1 The Main Stages of Image Processing 7.2.2 The Fundamentals of a New Hybrid Approach to Image Processing 7.2.3 How is this New Approach Different? 7.3 Human Visual Perception Modeling 7.3.1 Perceptual Classification of Digital Images 7.3.2 The Vague Models of Digital Images 7.4 Mathematic Modeling of Different Kinds of Digital Images 7.4.1 Images as the Special Kind of Spatial Data 7.4.2 Fundamentals of Topology and Digital Topology 7.4.3 Regularity and the Digital Topology of Regular Regions 7.5 Zadeh’s Information Granulation Theory 7.6 Fundamentals of Spatial Granulation 7.6.1 Basic Ideas of Spatial Granulation 7.6.2 Abstract Vector Space
176 176 176 178 179
189 190 193 197 207 216 217 221 221 222 222 223 223 224 224 226 227 228 230 230 232 235 235 236
Contents ix 7.6.3 Abstract Affine Space 7.6.4 Cartesian Granules in an Affine Space 7.6.5 Granule-Based Measures in Affine Space 7.6.6 Fuzzy Spatial Relation Over the Granular Models 7.7 Entropy-Preserved Granulation of Spatial Data 7.8 Digital Images Granulation Algorithms 7.8.1 Matroids and Optimal Algorithms 7.8.2 Greedy Image Granulation Algorithms 7.9 Spatial Granulation Technique Applications 7.9.1 Granulation of Graphical DataBases 7.9.2 Automated Target Detection (ATD) Problem 7.9.3 Character Recognition Problem 7.9.4 Color Images Granulation in the Color Space 7.9.5 Spatial Granules Models for the Curvilinear Coordinates 7.9.6 Color Histogram for Color Images Segmentation 7.10 Conclusions References
237 237 240 240 241 243 244 244 247 247 250 251 252 253 255 257 257
8 Inverse Synthetic Aperture Radars: Geometry, Signal Models and Image Reconstruction Methods 261 Andon D. Lazarov and Chavdar N. Minchev 8.1 Introduction 261 8.2 ISAR Geometry and Coordinate Transformations 263 8.2.1 3-D Geometry of ISAR Scenario 263 8.2.2 3-D to 2-D ISAR Geometry Transformation 266 8.3 2-D ISAR Signal Models and Reconstruction Algorithms 274 8.3.1 Linear Frequency Modulation Waveform 274 8.3.2 2-D LFM ISAR Signal Model - Geometric Interpretation of Signal Formation 275 8.3.3 ISAR Image Reconstruction Algorithm 277 8.3.4 Correlation - Spectral ISAR Image Reconstruction 279 8.3.5 Phase Correction Algorithm and Autofocusing 280 8.3.6 Barker Phase Code Modulation Waveform 289 8.3.7 Barker ISAR Image Reconstruction 290 8.3.8 Image Quality Criterion and Autofocusing 291 8.4 3-D ISAR Signal Models and Image Reconstruction Algorithms 296 8.4.1 Stepped Frequency Modulated ISAR Signal Model 296 8.4.2 ISAR Image Reconstruction Algorithm 298 8.4.3 Complementary Codes and Phase Code Modulated Pulse Waveforms 306
x Contents 8.4.4 ISAR Complementary Phase Code Modulated Signal Modeling 8.4.5 ISAR Image Reconstruction Procedure 8.4.6 Parametric ISAR Image Reconstruction 8.5 Conclusions Acknowledgment References 9 Remote Sensing Imagery Spatial Resolution Enhancement Sergey A. Stankevich, Iryna O. Piestova and Mykola S. Lubskyi 9.1 Introduction 9.2 Multiband Aerospace Imagery Informativeness 9.3 Equivalent Spatial Resolution of Multiband Aerospace Imagery 9.4 Multispectral Imagery Resolution Enhancement Based on Spectral Signatures’ Identification 9.5 Multispectral Imagery Resolution Enhancement Using Subpixels Values Reallocation According to Land Cover Classes’ Topology 9.6 Remote Sensing Longwave Infrared Data Spatial Resolution Enhancement 9.7 Issues of Objective Evaluation of Remote Sensing Imagery Actual Spatial Resolution 9.8 Conclusion References 10 The Theoretical and Technological Peculiarities of Aerospace Imagery Processing and Interpretation By Means of Artificial Neural Networks Oleg G. Gvozdev 10.1 Introduction 10.2 Peculiarities of Aerospace Imagery, Ways of its Digital Representation and Tasks Solved on It 10.2.1 Peculiarities of Technological Aerospace Imaging Process 10.2.2 Aerospace Imagery Defects 10.2.3 Aerospace Imagery Channel/Spectral Structure 10.2.4 Aerospace Imagery Spatial Resolution 10.2.5 Radiometric Resolution of Aerospace Imagery 10.2.6 Aerospace Imagery Data Volumes 10.2.7 Aerospace Imagery Labeling
309 311 317 323 324 324 327 328 328 330 336 341 346 359 360 361
369 371 373 375 378 378 380 381 382 385
Contents xi 10.2.8 Limited Availability of Aerospace Imagery 10.2.9 Semantic Features of Aerospace Imagery 10.2.10 The Tasks Solved by Means of Aerospace Imagery 10.2.11 Conclusion 10.3 Aerospace Imagery Preprocessing 10.3.1 Technological Stack of Aerospace Imagery Processing 10.3.2 Structuring and Accessing to Aerospace Datasets 10.3.3 Standardization of Measurements Representation 10.3.4 Handing of Random Channel/Spectral Image Structure 10.3.5 Ensuring of Image Sizes Necessary for Processing 10.3.6 Tile-Based Image Processing 10.3.7 Design of Training Samples from the Aerospace Imagery Sets 10.4 Interpretation of Aerospace Imagery by Means of Artificial Neural Networks 10.4.1 ANN Topologies Building Framework Used for Aerospace Imagery Processing 10.4.2 Object Non-Locality and Different Scales 10.4.3 Topology Customizing to the Different Channel/Spectral Structures of Aerospace Imagery 10.4.4 Integration of Aerospace Imagery with the Different Spatial Resolution 10.4.5 Instance Segmentation 10.4.6 Learning Rate Strategy 10.4.7 Program Interfaces Organization 10.4.8 Recommendations on the Framework Application 10.5 Conclusion References
386 386 387 388 390 391 392 394 397 398 399 402 406 407 413 418 421 421 423 424 435 436 438
Index 445
Abstract This book is dedicated to the unique interdisciplinary research of imagery processing, recognition and perception. The contents of this book are based on the concepts of mathematical processing, compositional analysis applied in art and design, and psychological factors of the information perception process. The conduction of compositional analysis carried out in the course of images processing and recognition, creation of the image project solution and modeling of the conceptual space structures are considered together with the mechanism of their perception. The influence of Internet memes flow on the social networks and face recognition technology subject to interferences are described. The algorithms of perception and improving of accuracy necessary for satellite imagery recognition and complex reflection from the object are represented with the use of artificial neural networks. The book may be interesting both for engineers and painters. Moreover, it may be useful for students and researchers working in the field of imagery processing, recognition and perception. Reviewers: - Dr. Ratnadeep R. Deshmukh, Professor, Department of Computer Science and Information Technology Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, India - Doctor of Phys. and Mathem. Sciences Gennady V. Kupovykh, Professor, Head of the Department of Higher Mathematics, EngineeringTechnological Academy of Southern Federal University, Taganrog, Russia
xiii
Preface Visual information is predominant in everyday human life. The use of recognition systems in the different fields of modern civilization subsistence is becoming the order of the day. These systems both ensure social security and make the living environment of the community more comfortable. They became available for each of us, so we no longer look surprised by the faces recognition capability ensured by our smartphones. The first chapter analyses the issues relating to imagery recognition and conducts a survey of modern tendencies. The peculiarities of images processing and perception are discussed, and the conditional neural processes in brain and psychological manifestations are described. The issues of compositional analysis carried out in the course of processing and recognition of different purpose images are envisaged in the second chapter. The stages of image composition analysis and detection of differential characteristics on the basis of current methods are represented. The third chapter is dedicated to the study of perception mechanism and creation of image solution in the course of designer project activities. The color, form and composition as elements of the artistic work are sign-symbolic means of the image-bearing expression. The fourth chapter deals with issues of associated perception of conceptual space structures. The associative simulation of environmental objects and structures is considered with due regard to perception psychology. The informative presentation allocation technology necessary for the basic recognition task is studied in the fifth chapter. The face recognition method that is stable regardless of lighting, make-up and masking is represented. The sixth chapter is dedicated to the development of methodology and software implementation of algorithms for the intellectualized processing of Internet meme flows and detection of their social and political influence in the social network. The new imagery perception and recognition methods as well as algorithms based on the mathematical approach to the granulating of satellite xv
xvi Preface data are pointed out in the seventh chapter. The algorithms for the large classes of data with the spectral portraits of different signals are introduced. The methods of reverse radio detection and ranging carried out with the synthesized aperture are described in the eighth chapter. The process of signal generation after the complex reflection from the object is analytically described. The new method to increase the resolution of multispectral images due to the use of basic spectra is offered in the ninth chapter. The radiometric signals are redistributed in the image subpixels to improve the general physical resolution of the image. The tenth chapter deals with the theoretical and technological aspects of aerospace imagery processing and interpretation. The peculiarities of aerospace images, methods of their digital representation and interpretation of these images by means of current artificial neural networks are described. The partition of visual information into the constituent parts by means of the artistic-compositional analysis may be used by the computer vision as well as in the course of analysis and recognition of artistic works. The combination of artistic peculiarities with modern recognition technologies may be also implemented for the test systems designed for the diagnostics of human well-being. The book may be useful for students, engineers and researchers working in the field of imagery processing, recognition and perception. Editor Prof. Iftikhar B. Abbasov
1 Perception of Images. Modern Trends Iftikhar B. Abbasov
*
Southern Federal University, Academy of Engineering and Technology, Department of Engineering Graphics and Computer Design, Taganrog, Russia
Abstract
This chapter outlines the survey of some current lines of research in the field of visual information perception. The peculiarities of the eye anatomy of vertebrates, the structural elements of the visual system, the process of the senses and the perceptual organization of visual information by the brain are analyzed. The modes of eye movements, the structure of the eye bulb, current methods and means of eye movement recording are described. The questions concerning the perception of contour, light contrast and types of color sensing disorders are represented. The concepts of figure and ground are defined, and the factors of gestalt grouping, subjective contours and images recognition in space are analyzed. The peculiarities of monocular and binocular vision are pointed out. The process of perceptual organization, constancy factors, types of visual illusions and causes of their occurrence are described. Keywords: Visual system, sense, perception, types of eye movements, color sensing disorder, factors of gestalt grouping, monocular and binocular vision, visual illusions
1.1 Visual System 1.1.1 Some Modern Research Let’s specify some works with a focus on the questions relating to visual perception. The work of [Pepperell, 2019] deals with the conceptual problems of visual perception with respect to the field of art. The author Email: [email protected] Iftikhar B. Abbasov (ed.) Recognition and Perception of Images: Fundamentals and Applications, (1–62) © 2021 Scrivener Publishing LLC
1
2 Recognition and Perception of Images is a painter, and some of his pictures in the context of which the phenomenon of perception of works of pictorial art (Figure 1.1.1) is envisaged are represented in the work. In the opinion of the author the science is based on rational descriptions, and the art uses artful, aesthetic and philosophical concepts. The statement of German von Helmholtz about the perception of the external environment is mentioned. The qualities of our senses are projected on the objects in the external environment. They seem to be colorful, red or green, cold or warm, with smell or taste. Even understanding that these concepts are subjective and belong to the properties of our nervous system, the art can help us understand the complicated perception problems. In the opinion of the author the perception of the real world and perception of images of the real world represent the extremely knotty problem associated with the rationality notion. The study of these problems may help us achieve progress in understanding perception in the context of art and current scientific and intellectual breakthroughs. The results of a study of neuroelectrical brain activity in the course of perception of a particular set of images are presented in the work of [Maglione et al., 2017]. The original (Original set) and stylized pictures by Titian and a contemporary painter were shown to the selected group of people who passed the art studies. The stylized pictures were got from the original ones by means of leaving the color patches (Color set) and contour forms (Style set) (Figure 1.1.2). The picture and its processed images of L.D. Kampan, the modern painter, are represented in the first row, and the picture by Tiziano Vecellio and the relevant processed analogs in the second row. The research subjects were randomly affected by the offered irritators. In the course of experimentation it has been found that the Original set induced more emotions than other sets of pictures during the first 10 seconds. The emotions brought out by Color and Style sets were intensified
Figure 1.1.1 “Grey on orange”, “Flower”, “Drawing drawing” [Pepperell, 2019].
Perception of Images. Modern Trends 3
(a)
(b)
(c)
Figure 1.1.2 Example of three groups of stimuli tested in study (Original, column (a); Style, column (b); Color, column (c) [Maglione et al., 2017]).
in 30 seconds. However, the emotions held steady for the first Original set. The evidences of cortical activity stream from the parietal and central regions to prefrontal and frontal regions in the course of viewing of images of all data sets were detected throughout the whole experiment. This result is congruent with the idea that active perception of images with the sustained cognitive attention in the parietal and central regions leads to the forming of judgments about their aesthetic valuation in the frontal region of head. It has been found that different regions of the brain including not only the frontal but also the motional and parietal cortical layers take part in the valuation of perceivable stimuli. It is necessary to admit that the prefrontal dorsolateral brain cortex is selectively activated by the only stimuli considered to be attractive. At the same time, the prefrontal activity is generally activated during the valuation of both pleasant and unpleasant stimuli. The value of research is that the neuroelectric visualization may be used for receiving useful information relating to the evolution of the aesthetic view of people who perceive the images from simple stimuli to the works of art.
4 Recognition and Perception of Images The work of [Tikhomirov et al., 2018] is dedicated to the study of agnosia of visual objects to determine the pathologies of the brain. The visual agnosia has arisen in the case of damage of brain cortex structure responsible for the analysis and synthesis of information thus leading to the violation of the perceptual process and recognition of visual stimuli. The contemporary views about the neuroanatomical and neurophysiological basis of the visual process are described. The medical cases of visual objects agnosia and peculiarities of neuropsycological diagnostics and post-hospital rehabilitation of patients are presented. The visual objects of different origin such as object images, geometric figures, letters, words and faces with various emotional expressions were used for testing. The basic test models were based on the modified methods of Wundt (Figure 1.1.3(a)), Stroop’s method (Figure 1.1.3(b)) and Gottschaldt figures (Figure 1.1.3(c)). The error types, response duration and detection time were evaluated; the recommendations on the diagnostics of visual agnosia in the clinical practice were worked out. Furthermore, we’ll analyze some main concepts relating to the biological evolution of organs of visual sensing, the structural features of the (a)
Find the number
Unicum
Twins
Floriculturist
Count the colors
Floriculturist: overload
(b)
(c)
Speed comparison
Polygons
Spatial comparison
Figure 1.1.3 Examples of interactive realization of basic test models: (а) Wundt’s method; (b) interference test by Stroop’s method; (c) test by Gottschaldt figures [Tikhomirov et al., 2018].
Perception of Images. Modern Trends 5 human eye as well as the process of processing and perception of visual information by the brain.
1.1.2 Light Perception Photosensitive tissue is present in the simplest organisms. The reaction to light is fundamentally different from the formation of a visual image. The visual structures of the simplest organisms only accumulate light using a photosensitive pigment [Shiffman, 2008]. An eye capable of forming an image appeared in the later stages of evolution. The complex eye of arthropods consists of a bundle of conical elements (ommatidium, Greek – small eye, Figure 1.1.4) containing a lens and a photosensitive pigment [Website istockphoto, 2020]. Each ommatidium registers the incident light from the opposite side, resulting in an image consisting of individual signals. The resulting image is therefore grainy. Such a complex eye is effective in detecting minor changes in the visual field. According to this principle, a complex eye of a fly acts, which is not so easy to catch. A compound eye is effective for detecting closely spaced objects. The eyes of vertebrates, in contrast to the complex eyes, are perfectly visible at a great distance. The eye of each biological species is maximally adapted to the conditions of its natural habitat and way of life activity.
1.1.3 Vertebrate Eye Anatomy Anatomically, the eyes of all vertebrates have a similar structure. From fish to mammals, the eyes of all vertebrates have a photosensitive layer, called
Simple eyes
Lens Pigment Visual cell
Complex eye Nerve fiber
Figure 1.1.4 Structure of the compound eye of a fly.
6 Recognition and Perception of Images the retina, and a lens to focus the image on the retina. Figure 1.1.5 presents a vertical section of the human eye. The eyeball is located in the deepening of the skull, and has a spherical shape with a diameter of about 20 mm. Outside the eyeball is covered with sclera, a white opaque sheath about 1 mm thick. On the front surface of the eye, the sclera enters the transparent membrane – the cornea. The curved surface of the cornea provides the necessary refractive index (refraction) in the optical system of the eye. It has no blood vessels, it receives nutrients from the capillaries and liquids surrounding it. Light rays are refracted on the cornea and focused by a lens on the retina, located on the back of the eyeball. The vascular membrane of the eye is associated with the sclera, has a thickness of 0.2 mm, and consists of blood vessels that feed the eyes. The anterior part of the choroid is a colored concentric disc called the iris. From a biological point of view, the iris contains a landscape filled with rings, dashes, specks; each person has more than 200 individual distinguishing features. The iris is a disc-shaped, colored membrane consisting of two smooth muscles, located between the cornea and the lens [Fershild, 2004]. Pigmentation of the iris is determined by the concentration of melanin inside the iris. The main function of the iris is to regulate the amount of light that enters the eye. In low light, the iris contracts, and the pupil increases in the form of a round black hole. In bright light, the reverse process occurs; the iris is stretched, and the pupils are narrowed. The change of
Lens Sclera
Iris
Vitreous humor Cornea
Central fossa
Pupil
Blind spot
Watery moisture
Visual nerve
Ciliary body Retina
Figure 1.1.5 Vertical section of the human eye.
Choroid
Perception of Images. Modern Trends 7 the pupil is controlled by two oppositely directed iris muscles – the sphincter and the dilator of the pupil. In an adult, the diameter of the pupil varies from 2 to 9 mm; this leads to a change in the area of the pupil of more than 20 times. Usually the pupil responds to changes in illumination reflexively; bright irritating light causes Witt’s reflex (described by physiologist Robert Witt in 1751). Witt’s reflex is used to diagnose diseases of the central nervous system, as well as to determine signs of life in a person during rescue operations. The human pupil is round, but the pupils of some species may have a different shape. The pupils of the cat’s eyes are in the form of a vertical slit, and nocturnal animals (including crocodiles) have such pupils. Nocturnal animals also have a retinal layer, called tapetum (lat. tapete – carpet). It reflects part of the light that enters the eye. It is the reflection of light from the tapetum that causes the eyes to “glow” at night. In the dark, we notice well the “glowing” eyes of domestic cats. A similar effect in the human eye is known to us as the “red-eye effect” when taking pictures with flash. In low light, the pupil expands; despite the speed of the reaction (about 0.25 s), it does not have time to narrow during an outbreak. In the “anti-red eye” mode, a warning flash is provided in the cameras to constrict the pupil. The lens divides the eyes of vertebrates into two parts. The front part is filled with watery moisture; it helps to maintain the shape of the eye and is involved in metabolism in the cells of the cornea. The large back end behind the lens is called the vitreous body and consists of gelatinous protein. These transparent substances allow you to hold the lens in a certain position. The elasticity of the eyeball ensures its resistance to mechanical damage. The refractive index of liquids inside the eye is approximately equal to the refractive index of water; they are slightly higher for the cornea and lens. Unlike constantly renewing aqueous humor, the vitreous remains unchanged. Sometimes pollution is indicated by randomly “floating” small “threads” in front of the eyes.
Experimental confirmation The existence of a blind spot can be proved using Figure 1.1.6. After completing a series of simple steps, you will see that the visible image suddenly “disappears.” Close the left eye and place the picture at a distance of about 25 cm. Focus the right eye on the cat, while slowly approaching and moving the drawing. At a distance of approximately 15 ÷ 30 cm (depending on the
8 Recognition and Perception of Images
Figure 1.1.6 “Cage without bird”, the scheme for the detection of “blind spots”.
scale of the image) the bird will disappear and the cage will empty. When the drawing is at a certain distance from the eye, the image of the bird in the cage will fall on the blind spot and disappear completely. However, despite the disappearance of the bird, you will not see a white spot; instead of it the area will be filled with cage bars [Abbasov, 2016].
1.1.4 Projection Areas of the Brain With the help of synaptic connections, each neuron can be connected to thousands of other neurons. Given that the human nervous system has about one hundred billion neurons, and each of which is able to create thousands of synaptic connections, then the total number of synaptic connections in the nervous system becomes on the order of a hundred trillion. To form the perception of incoming information, electrical signals from a neuron enter the brain. This is due to the nerves, tracts and nuclei of the central nervous system. A nerve is a bundle of axons, through which neural impulses are transmitted from one segment of the nervous system to another. Sensory information is transmitted by nerves to the central nervous system, consisting of the spinal cord and brain. In the central nervous system, there are such sections in which synaptic connections are formed by large groups of neurons, called nuclei. The main function of the nuclei is the processing and analysis of the received sensory information. One of the most important nuclei is the thalamus, located in the forebrain, below the center of its hemispheres [Kassan, 2011], [Gregory, 1970].
Perception of Images. Modern Trends 9 Signals from neurons enter certain parts of the cerebral cortex. The cerebral cortex (lat. cortex – tree bark) is a thin outer sheath of the brain hemispheres. Its thickness does not exceed 2 mm, because of the winding shape of the surface, it occupies an area of about 1.5 square meters. The brain (Latin cerebrum, Greek ἐγκέφαλος) is the main organ of the central nervous system; its head end, in vertebrates, is located inside the skull. In the anatomical nomenclature of vertebrates, including humans, the brain as a whole is most often referred to as encephalon. Highly specific parts of the brain that are associated exclusively with certain sensory modalities are called primary projection zones of the cortex. As shown in Figure 1.1.7, the main projection zones of all sensory systems lie within certain lobes of the cerebral cortex. The lobes are called the anatomically distinguishable area of the cerebral cortex that performs a specific function. The primary projection area for hearing is in the temporal lobe, the primary projection zone for tactile sensations (somatosensory cortex) is located in the parietal lobe. The primary projection zone for vision is in the occipital lobe (striatal cortex), and the olfactory bulb, located below the temporal lobe, above the sinuses is responsible for the perception of odors.
Area controlling arbitrary movements
The central sulcus
Area of tactile sensitivity Parietal lobe The main center of speech perception
Frontal lobe The motor center of speech Area of hearing perception
Occipital lobe
Lateral sulcus
The area of visual perception
The area of censorial, visual and auditory
Cerebellum Brain axis Temporal lobe
Figure 1.1.7 Projection zones of the left hemisphere of the brain.
10 Recognition and Perception of Images
1.2 Eye. Types of Eye Movement 1.2.1 Oculomotor Muscles and Field of View Human eyes are located in the depressions of the skull and are controlled by three pairs of muscles, called oculomotor or oculomotor muscles (Figure 1.2.1). Eye movements allow us to keep images of moving objects on the retina, to monitor objects without turning our heads. The central fossa, the macula and the peripheral part of the retina are involved in the process of vision, so our eye is a wide-angle optical system [Shiffman, 2008], [Kassan, 2011]. The field of view of one eye is downwards 70°, upwards 60°, to nose 60° and to temple 90° when observed with fixed eyes, and the sharpness of the image is provided only by the area of yellow spot within 6° – 8°. The eyes of animals of many species are unable to move autonomously. For example, the eyes of a night owl are so large for her small skull that they almost touch each other. Because of this, the owl’s eyes are motionless, and in order to receive visual information, she has to turn her head.
1.2.2 Visual Acuity Visual acuity refers to the ability to recognize small parts. There are five main types of acuity: detection, localization, resolution, recognition and dynamic acuity. The detection acuity is characterized by the detection of Block Upper slanting Frontal bone Upper straight Tendon ring Visual nerve Side straight Bottom straight
Upper jaw
Figure 1.2.1 Human oculomotor muscles.
Lower oblique
Perception of Images. Modern Trends 11 an object in the field of view. It is usually necessary to detect a small object of a certain size against a darker background. The acuteness of localization lies in the ability to distinguish between two lines, the ends of which are in contact with each other, as one single solid line (Figure 1.2.2, left). The sharpness of the resolving power of the eye is the ability to perceive the boundary between discrete elements (Figure 1.2.2, right); as it approaches the visual stimuli merge into one. Testing the sharpness of resolution is usually performed using Landolt rings (Figure 1.2.3, right). To determine the acuity of recognition, a letterbased test is used, known to us as the Snellen table, which we encounter in an office of an optometrist (Figure 1.2.3, left). Dynamic acuity is determined by the ability to detect a moving stimulus and track its movement. With increasing speed of movement of the visual stimulus, it decreases. The process of recognition of a visual stimulus depends on the size of the stimulus and on the distance to it. These two parameters are determined by the concept of the angle of view, the angle of view is the magnitude of the projection of the stimulus on the retina. The angle of view is measured in degrees; you can also use another subjective parameter – the rule of thumb. The width of the thumb of average size at arm’s length has the value of the angle of view of the order of 2°. Using this rule, it is quite easy to determine the approximate angles of view of objects that have been removed; for example, the moon or the sun make an angle of about 0.5°.
Figure 1.2.2 Determination of the sharpness of the resolution of the eye.
Figure 1.2.3 Snellen table and Landolt rings.
12 Recognition and Perception of Images 0%
10%
20%
Figure 1.2.4 The stylized Landolt rings without the noise and with different noise level [Eremina, Shelepin, 2015].
The work of [Eremina, Shelepin, 2015] is dedicated to the study of visual perception by patients after cardiosurgical operations. The article describes the results of research of visual perception peculiarities and cognitive processing of visual images by the patients with ischemic heart disease. One hundred and seven middle-aged patients (62 years old) took part in the study. It was demonstrated that the capability to recognize the fragmented images declined immediately after the surgery (exemplified by Landolt rings, Figure 1.2.4). However, such capability is not only recovered but also surpasses the pre-surgical level in three months. Moreover, the patients undergoing surgery cope better with the offered tasks in three months after the surgery than patients getting non-surgical treatment.
1.2.3 Types of Eye Movement Our brain usually tries to avoid emptiness, monotony, and an aggressive visual environment. The visual system highlights certain areas of the surrounding scene and controls the eye to recognize their details. The angle of view of a clear vision is about 2° of the visual field, and the oculomotor (oculomotor) muscles constantly make various movements to search for small objects. These movements allow you to direct the view in such a way that the visual stimulus is projected onto the central fossa, since this part of the retina is characterized by the most acute vision. Our eyes can both be fixed on the object and move continuously. Consider some publications in this area that describe the main types of eye movement. The article of [Butenko, 2016] presents a brief description of types of eye movements and analyzes the methods and systems registering the oculomotor activity. The movement of the eyes is the natural component of visual perception. The eyes make micromovements even in the case of relative fixity of look. It is generally known that there are eight types of eye movements which belong to the micro- and macromovements. The macro eye movements characterize the change of eye direction and can be controlled. They are divided into macrosaccades (sudden change
Perception of Images. Modern Trends 13 of eye direction), accompanying movements (the sweeping drift of gaze after the moving fixation object), vergence eye movements (coupling and decoupling of visual axes), nystagmus (vibratory eye movements together with the accompanying movements) and torsional movements (rolling eye movements about the visual axis). The micro eye movements are the natural background of oculomotor activity that cannot be controlled. The micro eye movements are divided into the tremor (frequent eye vibrations), drift (sweeping drift of gaze interrupted by the micro jumps) and microsaccades (rapid eye movement occurring in the case of fixation points change). Oculography (eye tracking) is a determination of eye position and meeting point of the eye bulb optical axis with the plane of observable object. The eye tracker is a device used to determine the orientation of the eye bulb optical axis in space. The eye trackers are applied in the course of study of the visual system as well as in psychology and cognitive linguistics. The tracking methods can be divided into two groups: contact and noncontact. The first group is represented by the electro-oculography, photooptic and electromagnetic methods. The second group includes the photoelectric method and video recording. The electromagnetic method is based on the change of equivalent stress in which any eye movement is transferred. The inductive emitter is fixed with the sucking cup (contact lens) on the eye bulb, and the receiver coils are placed around the head. The emitter establishes the alternating electromagnetic field in the receiver coils. The moving of emitter results in the change of electromagnetic field strength. Furthermore, the signal is amplified being transmitted to the input of recording oscillographs. The photooptic method is based on the recording of reflected light when the sucking cup with the miniature mirror from which the narrow beam of light is reflected and fallen on the input of light-beam photographic recorder is set on the eye bulb. The photooptic method allows the analysis of the microformation of oculomotor activity. Electrooculography. The measurement of potential differences in the tissues adjacent to the eyehole is a basis of electrooculography. The eye movements are recorded by means of electrodes set around the eye pits. The potential marker points at the gaze direction, and the variation value of potential difference indicates the rotation angle. The photoelectric method is based on the transformation of infrared light beam reflected from the cornea into the electric signal. The amount of reflected light is changed in the course of eyes moving, and the photocurrent value is changed respectively. Video recording. The video recording method embraces two interdependent procedures – the video recording of eyes of the test person
14 Recognition and Perception of Images and programmed determination of eye direction on each frame of video sequence. The pupil edge or center, sclera blood vessels or corneal speck is a source of information about the eye direction. The other kind of video recording method intends the eye lightening with the point source of infrared light and high-speed shooting by infrared video camera. This method is used in the devices produced by Tobii Company. Such eye tracking devices embrace: Tobii REX, Tobii EyeX, Tobii TheEyeTribe [Website tobii.com, 2020]. Let’s consider some current articles concerned with the study of oculomotor activity. The work of [Wegner-Clemens et al., 2017] is dedicated to the study of gaze fixation in the course of the viewing of faces. Although the human face features many visual peculiarities, observers prefer to fix their eyes mainly on the eyes and mouth. It is more explained by the evolution of social signs of recognition and human society communication. The experimental findings of 41 participants looking at four faces with different starting stimuli were represented (Figure 1.2.5). The blue ellipses on the face show the position of each particular eye fixation, and the size of every ellipse is proportional to the fixational pause. The degree of eye fixation on the face from the left side (Figure 1.2.5) is presented in the form of thermal map. The degree of eye fixation on the faces is a useful tool to evaluate the individual differences and subsequent recommendations for social communication. The article of [Ito et al., 2017] provides a new method to analyze the eye movements during the natural viewing. The method is based on the fact that eyes move from and to the objects of the visual scene. The method was
Figure 1.2.5 Comparison of methods for determination of regions of interest and principal component analysis [Wegner-Clemens et al., 2017].
Perception of Images. Modern Trends 15 used experimentally for two macaques freely looking at the visual stimuli in the form of scenes. The analysis revealed that monkeys had a behavioral shift from the free viewing to the focal processing of visual stimulus. The work of [Barabanschikov et al., 2010] analyzes the definition of oculomotor activity and expressions perception depending on the space orientation of the face. It is demonstrated that 180° rotation of face image results in the declining of efficiency of recognition of emotions intensity. The weak expressions of the face on the reversed image are perceived as the calm state. This process depends on the emotions modality and has the complex nonlinear nature. The “fear” and “grief ” are identified worst of all, and the calm face expression – better and more stable. If increasing the intensity of emotions of the face on the direct-viewing image the effect of left-sided dominance is registered, and in case of reversed image the effect size is increased (Figure 1.2.6). However, the dominance of perceiving of feebly marked expressions has a right-sided nature. The work of [Basyul et al., 2017] is dedicated to the study of oculomotor activity in case of face recognition. Thirty-two persons of Tuvan nationality who evaluated the likeliness of goal objects presented in pairs by means of a five-grade scale took part in the experiment. The goal pairs of photos with human faces which were coupled by means of two original photos (Caucasoid and Asian type) and four intermediate photos were used as the stimulus material. They were made with the help of morphing in increments of 20% (Figure 1.2.7). The oculomotor activity of test subjects was registered by means of eye tracker EyeTribe in the course of stimulus material presenting. The actuality of research is determined by the questions of maintenance of public security and business communication. Let’s consider the main types of micro- and macromovements of human eye used by us in everyday life. Saccades. The most frequent eye movement is the saccade (French, saccader – tugging). A saccade is a tear-off, discontinuous movement of the eyes, with a quick transfer of gaze from one object to another. Saccades can be small (less than 3° fields of view) and large (about 40°). Macro saccade occurs when abrupt changes in the position of the eye are performed with high speed and accuracy. Usually the frequency, angular velocity and direction of gaze are determined by the nervous system in advance. In order to avoid undesirable effects, saccadic movements are performed extremely quickly. Muscles performing saccadic movements are among the fastest muscles in the body. Usually the number of saccades is up to 3 times per second. Saccades can be reflex and controllable, and they are also performed in a state with closed eyes.
16 Recognition and Perception of Images
Figure 1.2.6 Examples of oculograms of different test subjects who perceive the expressions of face on the direct-viewing and reversed images [Barabanschikov et al., 2010].
100/0
80/20
60/40
40/60
20/80
0/100
Figure 1.2.7 Transition row of face images from which the goal stimulus pairs are formed [Basyul et al., 2017].
Perception of Images. Modern Trends 17 Saccades are used primarily for examination and study of the visual field. They are especially important when performing such visual tasks as reading or viewing paintings, photographs, and portraits (Figure 1.2.8, 1.2.9). Eye movements when viewing an image ensure that its various parts get into the zone of the central fossa, which makes it possible to examine the details. The points at which the gaze is fixed are not accidental. These are the most informative places of the image, the places where the most important signs are located. For example, when scanning a face in a photograph, the set of fixation points falls on the areas where the eyes, nose and mouth are located. It is well known that the algorithm for eye movement during reading is intermittent. While reading, the eyes perform a series of saccades alternating with pauses, fixations, and some return movements (repressions). The reading process itself takes place precisely during fixations; with saccadic movements, the vision is functionally blocked.
Figure 1.2.8 Schemes of eye movement when viewing various images [Yarbus, 1967].
18 Recognition and Perception of Images
Figure 1.2.9 A reproduction of the painting “Didn’t Wait” by Russian artist I. Repin [Repin, 2020], was used to determine oculomotor activity (right) [Yarbus, 1967].
The reader seeks to fix the view on the most meaningful fragments of the text; as a rule the word is the object of fixation. When reading, we can recognize an average of 4-5 letters to the left of the fixation point and 8-11 to the right with normal font sizes. The spaces between the words play the role of visual design; they are crucial for the algorithm of eye movements. Reading speed significantly decreases with no gaps. Try to read the text typed without spaces, presented in Figure 1.2.10. This will require fixing the gaze on each letter, but if you continue the reading process, then addiction will occur; the eye will perform shorter saccades, and the reading comfort will increase. Eye tracking. Tracking eye movements are a reflex, and occur when tracking a moving stimulus. Unlike saccades, these movements are smooth and slow. Usually the speed of tracking movements is determined by the speed of movement of the stimulus; this ensures the stabilization of the image on the retina. Vestibulo – ocular movements. When we change the position of the head and body in space, we continue to perceive our surroundings in a stable way. This is achieved through compensatory movements of the eyeballs,
Figure 1.2.10 Text without spaces between words.
Perception of Images. Modern Trends 19 which allow you to maintain a stable image. These movements are reflex vestibulo – ocular movements. Eye movements are stimulated by the vestibular apparatus of the middle ear, where the sensory system is located to determine the location of the body in space. During physical activity, the eyes perform precise movements, compensating for both body movements and head movements. Vergent eye movements. Sometimes there is a need for coordinated movements of both eyes; these movements are called vergent. Vergent movements move the eyes horizontally in opposite directions in such a way that the mixing and dilution of the visual axes (convergence and divergence) occurs. This allows both eyes to focus on the same object. Such eye movements are characteristic of primates, in which the frontal position of the eyes and the field of view has a binocular overlap. Vergent movements can be observed when the reduction of a glance at the tip of one’s own nose is required. Eye micromovements. When you fix the look at the stimulus, you can observe a number of reflex movements, which are called micromovements of the eyes (tremor). As a result of micromovements, the axis of the eye describes a closed figure in the form of an ellipse. This is the natural motor background of the activity of oculomotor muscles, which is not consciously controlled. With the help of special devices you can register these micromovements of the eyes. In the process of fixing the eyes are in constant motion, if you completely exclude involuntary, small movements of the eyes, the image of the stimulus on the retina begins to blur and disappear. Mixed movements. The visual perception of the environment usually occurs through a combination of different types of eye movements; they are mixed movements. For example, observation of a moving object requires both smooth tracking movements and saccadic and vergent movements. Effective eye movements are achieved at a certain level of development of oculomotor muscles. In children of 4–5 years of age, eye movements are different from those of adults, so their vision is less effective; it is difficult for them to fix their gaze on a particular object. When they were asked to fix their gaze on a small, bright, stationary object in a dark room, their line of sight scanned the area 100 times larger than in adults. Children do not predict changes in the direction of movement of the object; these movements are formed gradually on the basis of practice. It should be noted that the concept of tracking is used to control attention when designing advertising, when the eye of the observer is directed to a certain part of the visual field. In web design, F-pattern-based navigation is highlighted (Figure 1.2.11). As you read consecutive lines, the number of commits decreases, and their duration also decreases for any length
20 Recognition and Perception of Images
1
2
3
4
Figure 1.2.11 F-Pattern navigation.
(a)
(b)
Figure 1.2.12 Comparison of certainty value of web page and saccadic estimation, (a) web pages, (b) thermal map of eye fixation [Xia, Quan, 2020].
Perception of Images. Modern Trends 21 of line and any line spacing. The movement of our gaze usually covers the right side of the visual field, narrowing as the gaze moves down the text. Our attention avoids non-informative images consisting of small homogeneous elements; they can cause unpleasant sensations and illusions. The work of [Xia, Quan, 2020] reports the studies on the modeling of users’ attention looking through web pages. The saccadic search model of dynamic visual behavior of people is offered in the work. The multilevel analysis of signs of visual information and definition of probability of user eye fixation are used. The experimental results concerning the eye tracking in the course of free web pages viewing showed the efficiency of the offered method (Figure 1.2.12).
1.2.4 Effects of Masking and Aftereffects If visual stimuli occur in close sequence from each other (in time or in space), they can overlap each other and mask the perception of each other. The cause of the masking effect is the inertia of vision due to the slowness of the neural response to the stimulation, as a result of which the response to the stimulus can be maintained even after its disappearance. During saccades, the flow of visual information stops for a short time, but we do not notice this saccadic overshoot. If we look at ourselves in the mirror to detect these movements, we will not see any movements, overshoots, misting. But looking at other people performing the same actions, we will see how their eyes move. The cause of the saccadic “breakthrough” (empty field) are masking effects. With long-term observation of the stimulus, not only the inertia of vision can manifest itself, but also adaptation or temporary “fatigue” of the stimulated portion of the retina will occur. These effects can be demonstrated by a slope aftereffect based on the experience of J. Gibson (Figure 1.2.13) [Shiffman, 2008], [Abbasov, 2019]. To do this, close the vertical grating (Figure 1.2.13, right), for 40-60 seconds, fix the view on the point between the inclined gratings (Figure 1.2.13, left). Then open the right side and quickly translate the view on the point between the two vertical bars (Figure 1.2.13, right). For some time you will find that the vertical lines are inclined in the direction opposite to the direction of inclination of the grating lines in Figure 1.2.13, on the left. The post-slope effect is manifested in the short-term distortion of the perception of the apparent orientation of the visual stimuli that arise due to adaptation to the previous stimulus, oriented differently. You can also demonstrate the effect of curvature aftereffect. It manifests itself in a violation of the perception of the obvious form of the stimulus due to the prolonged effect of the previous stimulus of a different
22 Recognition and Perception of Images
Figure 1.2.13 Tilt action.
shape or curvature. This phenomenon is illustrated by the experience presented in Figure 1.2.14. Similarly, for 40-60 seconds, fix your gaze on the point between the two curves (Figure 1.2.14, on the left), then transfer the glance to the fixation point between the straight lines (Figure 1.2.14, on the right). Vertical lines will seem curved in the opposite direction to the curve of the curves.
Figure 1.2.14 Curvature after-effect.
Perception of Images. Modern Trends 23
1.2.5 Perception of Contour and Contrast In a consideration of the current works in the field of contrast perception it is necessary to acknowledge the sufficiently detailed survey of [Ghosh, Bhaumik, 2010] in which the historical and philosophical aspects of brightness perception are analyzed. Although these questions have been studied for 200 years the mechanism of influence of some optical illusions has not been determined to a full degree. The neuronal mechanism of perception formation from the visual receptors to brain cortex is described on the basis of bottom-up processes (Figure 1.2.15). The internal test squares with the same brightness form a contrast to the gradient ground (Figure 1.2.15, from the left). Moreover, the contrast effect is remained on the ground of sole-colored vertical stripes (Figure 1.2.15, from the right). The perception of brightness is offered to be analyzed according to two philosophical tendencies: idealistic and materialistic approaches. The theory of contrast is based on the lateral inhibition similar to the principal dialectic laws: contradictions unity, interrelation, transition from quantity to quality, negation of the negation. Some recent researches of the experimental psychophysics and application of mathematical methods of modeling are based on that. To ensure the adequacy of such models it’s necessary to apply the quantitative representation of complex psychical processes on the basis of dialectic law of interaction between the part and the whole. In the process of perception of visual information contours play a significant role. Using contours, we recognize the shapes, edges, and borders of surrounding objects. To study the neural mechanisms of perception of the contours, you can use the visual system of the horseshoe crab. The eye of the horseshoe crab is structurally complex and consists of about 1,000 receptors (ommatidia). Each receptor has its own neuron and optic nerve, which responds to the light signal independently of the adjacent receptors. They are not related to each other; stimulation of one receptor is not transmitted to the next. However, at a higher neuronal level, adjacent receptors are connected by lateral nerve fibers, while simultaneous stimulation of neighboring receptors results in the summation of their activity.
Figure 1.2.15 Brightness assimilation [Ghosh, Bhaumik, 2010].
24 Recognition and Perception of Images Illumination of one receptor reduces the sensitivity of its neighboring receptors, and lateral inhibition occurs. When simultaneously illuminated, each receptor responds less actively than in the case of individual stimulation. Similarly, the ganglion cells of the human retina function in a human; they have complex interrelationships and are not individually excited. Neural connections using lateral braking affect the activity of each other, thereby ensuring a clear perception of edges and boundaries.
1.2.6 Mach Bands, Hermann’s Grid To demonstrate the effect of lateral inhibition, consider the stepwise stretching of gray color in Figure 1.2.16. The left side of each vertical rectangular strip will appear a little lighter than its right side, which causes an increase in edge contrast. However, each strip has the same lightness; it is filled with a uniform gray color. This can be easily seen if we examine each rectangle alternately, covering the others. The effect of changing the lightness of the marginal sections is named after the 19th-century German physiologist Ernst Mach, who first described this phenomenon [Abbasov, 2016].
Spatial frequencies Contrast areas of the surrounding field of view can be characterized by spatial frequency, i.e., the number of luminosity variations in a certain part of space. For experimental confirmation, consider Figure 1.2.17. The left upper lattice has a relatively low spatial frequency (wide bands), and the
Figure 1.2.16 Mach bands.
Perception of Images. Modern Trends 25
Figure 1.2.17 Spatial frequency.
lower one has a higher frequency (narrow bands). The spatial frequencies (bandwidths) of the grids in Figure 1.2.17, on the right are identical and they occupy an intermediate position. Cover the grids in Figure 1.2.17, on the right and for at least 60 s, carefully examine the grids in Figure 1.2.17, on the right, fixing the view on the central horizontal strip between the gratings. After completion of the adaptation period, translate the view into the strip in the center between the two gratings in Figure 1.2.17, on the right. Spatial frequencies will no longer seem identical: the spatial frequency of the upper lattice will seem higher (denser) than the spatial frequency of the lower lattice [Shiffman, 2008], [Gusev, 2007]. Consider the effects based on the phenomenon of lateral inhibition, these include the lattice of Hermann and light contrast. Figure 1.2.18 shows the German grid (German physiologist Ludimar Hermann in 1870), it consists of a white square grid pattern on a black background. The lightness of the
26 Recognition and Perception of Images
Figure 1.2.18 Hermann’s grid.
white stripes is the same along the entire length; however, phantom gray spots will appear at their intersections. They are due to the suppression of the neural activity of neighboring cells of the retina. If we concentrate our gaze on a separate point of intersection, the gray spot will disappear. In this case, the image is projected on the central fossa, and gray spots appear on other crosshairs, which will be projected on the peripheral areas of the retina with high sensitivity.
Figure 1.2.19 “Complementary” grid of Hermann.
Perception of Images. Modern Trends 27 You can observe the colored spots on the crosshairs; for this you can choose a grid and a background complementary in color. Such a color grid is shown in Figure 1.2.19, blue spots will appear on the yellow cross hairs [Abbasov, 2019]. When scaling Hermann’s grid, the “phantom” spots will be more stable; for this it is necessary to move the pattern away to the projection of the grid on the retina. From a long distance, the stripes will narrow and phantom spots will be visible regardless of the gaze fixation.
1.2.7 Light Contrast A further example of the spatial interaction of neighboring areas of the retina is light contrast. It lies in the fact that the lightness of a small closed figure depends on the intensity of lightness of the massive background area (Figure 1.2.20) [Abbasov, 2016]. From the point of view of light reflected by them, all four central gray circles are identical; however, it seems that they differ in lightness. A circle on a dark background (left edge) seems lighter than a physically identical circle on a light background (right edge). Therefore, this means that the perceived lightness of the surface depends on the intensity of its background.
1.2.8 Object Identification There are several theories of identification based on the distinctive features of surrounding objects. At the initial stage of anticipation, there is a fast processing of the information received, which allows you to perceive the basic, very simple and noticeable distinctive features of the object, the so-called perceptual primitives. The surfaces differ from each other by the simplest elements of the texture - by the textones, the specific distinguishable characteristics of the elements forming the texture. In Figure 1.2.21, one can observe a textural background from the set of letters P, Б, Ь; however, to find the letter B on such a background, concentration of attention will be required.
Figure 1.2.20 Light contrast.
28 Recognition and Perception of Images
Figure 1.2.21 Texture pattern of various letters.
According to another theory, object recognition begins with processing information about a set of primitive distinguishing features. Any object of three-dimensional space can be decomposed into a number of geometric primitives (geons: sphere, cube, cylinder, cone, pyramid, torus, etc.). On the basis of various operations of combining, intersection of surfaces of primitives, you can create new or analyze existing objects. Similarly, any letter can be obtained from a set of lines and curves. According to the theory of geons (geometric ions) by Biederman [Shiffman, 2008], a set of 36 geons will be enough to describe the shape of all the objects that a person is able to recognize. According to the experiments, the object is recognized; its geons are perceived as well. Usually, the description of an object includes not only its features, but also the relationships between the constituent parts. After describing the shape of the object, it is compared with an array of geons that are stored in memory, and the most appropriate match is found.
1.2.9 Color Vision Abnormalities Color vision for most people is normal, but certain anomalies are characteristic of some. In people with color vision abnormalities, the quantitative ratio of primary colors is different from normal color vision. Anomalies of color vision are usually hereditary, and are associated with a lack of cones of a certain type. Based on the decoding of the genetic codes, each cone contains a photopigment with its own gene. The photopigment genes are found in the X chromosome; women inherit one X chromosome from their mother and one from their father.
Perception of Images. Modern Trends 29 To ensure normal color vision, at least one chromosome must contain genes of normal photopigment synthesis. Males inherit the X chromosome from the mother and the Y chromosome from the father. If the only X chromosome does not contain the gene for normal photopigment synthesis, then the son will have an anomaly of color vision. If a color anomaly has arisen in a woman, then this means that she has two defective X chromosomes and all her sons are doomed to a color anomaly. Therefore, when inheriting color vision anomalies, the genetic mechanism does not work in favor of men. Color vision anomalies exist in 8% of males and 0.5% of females. One of the first color vision anomalies in the eighteenth century was described by the English chemist John Dalton. By chance, he found himself suffering from a color perception abnormality. During a ceremony, he donned a crimson mantle instead of a black academic mantle. He saw the blush on the cheeks of his girlfriend as green spots; the world was painted in a marsh-brown range. Since then, the anomalies of color vision (color blindness) have become known as blindness. To explain his anomaly, Dalton suggested that it was caused by the pathological staining of the vitreous body, which played the role of a filter. He made a will according to which, after death, his eyes should be opened in order to experimentally confirm the theory. Subsequent studies did not confirm Dalton’s theory, but the scientist’s eyes are still kept in the Manchester Museum of Great Britain [Fershild, 2004]. There are three types of color vision abnormalities: abnormal trichromatism, dichromatism and monochromatism. Normal color vision is three-component, and such color vision is called normal trichromatism. People with abnormal trichromatism need different quantitative ratios of primary colors. There are the following forms of abnormal trichromatism: –– protanomals (protanop) (Greek, protos – first); –– deuteranomals (deuteranop) (Greek, deuteros – second); –– tritanomals (tritanop) (Greek, trios – the third). The protanomal has an insufficient amount of long wavelength L cones, so it is not sensitive enough to shades of red. Dalton himself suffered from this anomaly. According to the theory of Goering, they cannot be implemented red-green opponent mechanism, with the result that they are unable to distinguish between reddish and greenish shades. In deuteranomals, sensitivity to green tones is reduced, which is a result of a lack of M cones; they also have difficulty distinguishing reddish from greenish tones. In tritanomals, there is a low sensitivity to violet tones
30 Recognition and Perception of Images characteristic of short-wave light, an insufficient number of S cones, less common than the rest. Tritanomals cannot distinguish between yellowish and bluish shades. Figure 1.2.22 shows some test images for determining color vision anomalies. For people with normal color vision, various combinations of numbers and geometric shapes will be visible in the pictures [Abbasov, 2019]. People suffering from dichromatism need only two colors to reproduce all color tones, instead of three, like people with normal color vision. Monochromatism is also encountered – an extremely rare defect in color vision. To reproduce all the color tones of the spectrum, monochromes need only one primary color. People with such an anomaly of color vision can be called “color blind”; usually they do not see any colors at all. Anatomically, the retina of the deuteranomals, protanomals and tritanomals differ from each other in the number of cones containing blue, green and red pigments. The cause of a defect in color vision is the relative lack of a specific cone pigment. Many people with color vision defects are not aware of this until a certain point. Some well-known artists also suffered from these anomalies (for example, Russian painter M. Vrubel with his gray-pearl scale; French artist, graphic artist S. Merion). It should be noted that there is also the phenomenon of subjective colors. Under certain conditions, black and white stimuli can cause a sensation of color. Intermittent stimulation can cause neural processes that mimic the effects of colored stimuli.
Figure 1.2.22 Test images for determining color blindness.
Perception of Images. Modern Trends 31
Subjective Color Sensations Experimental confirmation [Shiffman, 2008]. For 30 seconds, carefully examine the center of the rectangle with a diagonal pattern (Figure 1.2.23). You will begin to notice weak, unstable colored “streams” moving perpendicular to the black diagonal lines. Due to the micromovements of the eyeball, the image of the diagonals on the retina constantly moves, causing the rhythmic activity of the higher-level neurons of the visual system. Such patterns are not found in nature, so the visual system comes to an unstable state. This effect is most likely due to the fact that the size of the retinal projection of the structural parts of the image becomes co-measured with the size of the cone receptors themselves. In this case, the visual system goes into an unstable resonance mode, which causes illusions in the processing and perception of this visual information by the brain. Depending on the distance, the consequent retinal image size, this effect decreases or intensifies [Abbasov, 2016]. The work of [Doliotis et al., 2009] considers the phenomenology of colorblindness because up to 8-12% of men and 0.5% of women of European countries suffer from this abnormality of color sensation. Due to this, the necessity to work out the methods of modification of colored digital images to improve their perception by people with the defective vision has
Figure 1.2.23 Achromatic pattern that causes subjective color sensations.
32 Recognition and Perception of Images
Figure 1.2.24 Illustration based on dark shades of red and green and color illustration in the same shades of gray, color illustration, which in the gray version turns into a singlecolor image (horizontal strip) [Abbasov, 2013].
Perception of Images. Modern Trends 33 arisen. The particular data on the working out of methods of images modification for the color-blind people of the first category (protanopes that are insensitive to shades of red) are presented in the article. The colors causing the impairment of protanopes are assigned in the separate group, and the images are corrected and adapted for their perception [Itten, 2001]. The time of images processing is reduced due to the use of color quantization. This method but slightly changed may be also used for other types of color impairment. It may be applied to the deuteranopes who have some difficulty distinguishing shades of green. Figure 1.2.24, at the top shows an example of an illustration based on the dark shades of red and green. These shades are difficult to see (also for the author of the chapter), especially in a small area of the image. I would like to note that also these shades have the same tonality in the gray version. In the center for distraction, contrasting shades of blue, yellow and white are shown. Figure 1.2.24, below, is a color illustration, which in the gray version turns into a single-color image (horizontal strip). This emphasizes the important role of color vision in human life. By the way, the color of the adjacent areas of the image is induced on this gray strip, although it is just gray [Abbasov, 2013].
1.3 Perception of Figures and Background The work of [Buhmann et al., 1999] is dedicated to the methods of recognition of image and scenes in case of computer vision. Our visual system can reconstruct the three-dimensional figures on the basis of a flat image. The computer vision requires not only the recognition algorithms. Additionally, it is necessary to consider the lighting, shadows and scene perspective. Our perception of the surrounding world allows us to exclude the ambiguities. It is offered to apply the gestalt rules of objects grouping to recognize the scenes at the computer vision in the work. The images are to be analyzed from the whole to the part at the initial level while finding out the connection and interaction of constituent parts. Thereafter, the analyzed information is specified and interpreted. Furthermore, the objects are classified and recognized. The visual field surrounding us consists of objects of various shapes and colors; taking into account the distance, some overlap others, some are perceived as a figure, some as a background. From an evolutionary point of view, it is important for us to recognize the necessary object against the background of the environment [Shiffman, 2008], [Aliyeva, 2007]. Usually, out
34 Recognition and Perception of Images
Figure 1.3.1 The combination of “figure-background”.
of two objects of the field of view, a region with a marked border is perceived as a figure; it is smaller in size, or differs in filling, and is located closer to the observer (Figure 1.3.1) [Abbasov, 2019]. The fragments perceived as a figure contain for us more interesting information. Relationship figure and background can be perceived not only visually, but also in other sensory systems. For example, you can hear a bird singing in the background of street noise or a violin melody in the background of the sounds of the rest of the orchestra. The process of perception of three-dimensional objects on the basis of photographs and paintings is in fact an illusion. Our visual system is adapted to the perception of the real world based on the signs of the depth and distance of the picture plane.
1.3.1 Dual Perception of the Connection “Figure-Background” There are situations when the perception of the “figure-background” becomes ambiguous, dual (Figure 1.3.2), such as when two objects that have common borders and are included in each other can be perceived as a figure, as well as a background. When viewing, usually at the very beginning, one of the elements is perceived as a figure, but soon it begins to appear as a background, especially when they are equal in size. A cross can be perceived as a figure with divergent rays; concentric circles become the
Figure 1.3.2 Dual perception of background – figure combination.
Perception of Images. Modern Trends 35 background. But if you reorient yourself, the figure and the background will change places, the Flying Fish painting by the famous Dutch graphic artist Maurizio Escher becomes ambiguous [Escher, 1961].
1.3.2 Gestalt Grouping Factors When analyzing the organization of perception processes, many phenomena can be explained with the help of Gestalt psychology, which emphasizes the important role of the relationship between the whole part and the discrete elements. When viewing the patterns presented in Figure 1.3.3–1.3.5, one can observe the tendency of combining the elements of the pattern into different, mostly closed, rounded shapes [Abbasov, 2016]. Various options for combining elements arise spontaneously, spontaneously compete with each other, transform into other combinations. According to the law of clarity of structure, our perception identifies, first of all, the most distinct geometric structures. According to the law of addition to a structural whole, clear, but not complete structures are always supplemented to a clear geometric whole. However, the described gestalt processes may depend on certain cultural conditions of the formation of a person. In some cultures, an open circle can be perceived as a “bracelet,” and an open triangle as an “amulet.” Consider the main gestalt factors determining the characteristics of grouping objects.
Figure 1.3.3 Chaotic pattern.
36 Recognition and Perception of Images
Figure 1.3.4 Gestalt perception of the figure.
Figure 1.3.5 Pattern based on clear patterns.
Proximity factor. According to this proximity factor, objects can be grouped according to their location in space (or time), depending on the distance between them. In Figure 1.3.6 the trees are grouped in pairs. Similarity factor. Grouping on the basis of the similarity factor is presented in Figure 1.3.7. It seems to us that the same type of asterisks form
Perception of Images. Modern Trends 37
Figure 1.3.6 Proximity factor grouping.
Figure 1.3.7 Similarity factor grouping.
vertical lines. The grouping factor based on proximity and similarity also operates the process of sound perception. Sounds following one after another in time can be perceived as a melody. The factor of the same type of communication. Under the grouping on the basis of the same type of communication understand the perception of a single structure, formed by physically interconnected elements. As an example of such a perception, Figure 1.3.8 presents a set of asterisks connected in pairs with one-type lines. Factor “good continuation”. Elements located along a single trajectory are perceived by us as a single whole along a single line. The “good continuation” factor is illustrated in Figure 1.3.9. Two intersecting curved lines
Figure 1.3.8 Grouping on the principle of the same type of communication.
38 Recognition and Perception of Images D A
C B
Figure 1.3.9 Grouping factor “good continuation”.
are perceived as a curve going down (segment AB) and as a curve going up (segment CD). Although this figure can also be perceived as an image of sections of curves AD and CB. Factor “common fate”. According to the “common fate” factor, elements moving in one direction are perceptually combined into one group. Such grouping occurs on the basis of identity, but this principle can be applied to moving elements. For example, a flock of migratory birds in the form of a “wedge” are perceived as a group (Figure 1.3.10). This factor can also be attributed, “fascinating dance” – the synchronous movement of fish schools in the ocean or “waves” created by the movements of the hands of fans during a football match. Symmetry factor. In accordance with the factor of symmetry, when perception, more natural, balanced, symmetrical figures become priorities [Prokopenko et al., 2006]. In Figure 1.3.11, a black central column is predominantly observed, and the profile outlines can also be recognized on a white background (profile of Julius Caesar). Closure factor. When grouping elements according to the closure factor, preference is given to the variant that favors the perception of a more closed
Figure 1.3.10 “Wedge” migratory birds, the factor of “common destiny”.
Perception of Images. Modern Trends 39
Figure 1.3.11 Grouping by symmetry factor.
or completed figure (Figure 1.3.12, from above). Figure 1.3.12, on the left, presents an example of a gestalt grouping of elements – the constellation Ursa Major; it is formed on the basis of the factors of “good continuation” and closure. Although in reality the stars in the constellation Ursa Major are not related to each other, they are a random result of their perception from the Earth in this perspective. In Figure 1.3.12, on the right, one can recognize the figure of the jockey athlete during the jump [Abbasov, 2016]. We can give one more example of the interaction of the factors “good continuation” and closure when grouping elements. On Figure 1.3.13 on the left we see a random set of fragments of indefinite form. However, if these
Figure 1.3.12 Grouping factor closure.
40 Recognition and Perception of Images
Figure 1.3.13 “Cats and dogs”, the factors of “good continuation” and closure.
Figure 1.3.14 Triangle and circle configuration.
fragments are superimposed with the contours of the silhouettes of cats (Figure 1.3.13, right), then you can see the cat figures in the background of several dogs. It should be noted that the process of our perception is based not only on the definition of spatial connections between the figures, but also on the results of previous perceptions of similar forms acquired by us. Law of pregnancy or the law of good form – by this is meant the preferential perception of the simplest, most stable form of all possible, on the basis of various factors of grouping. For example, identifying a closed figure is easier than opening. Presented in Figure 1.3.14, on the left, we perceive the configuration as a superimposed figure of a triangle and a circle. However, it can be perceived in a different way, as a combination of three different figures.
1.3.3 Subjective Contours Sometimes the desire of our visual system to achieve closure can occur in a free area of the field of view; this leads to the appearance of boundaries
Perception of Images. Modern Trends 41 or contours, which are called illusory or subjective. Examples of subjective contours are presented in Figure 1.3.15 [Abbasov, 2019]. Pay attention to the fact that in some cases not only the outlines are visible, but also the figures completely. According to the Gestalt explanation, the difference in lightness of form and background leads to the fact that the figure appears lighter or more intense than the background. Figure 1.3.16 presents the Necker cube with subjective circuits as an example of the cognitive approach [Abbasov, 2019]. When viewing a
Figure 1.3.15 Subjective contours.
Figure 1.3.16 Necker’s subjective cube.
42 Recognition and Perception of Images drawing, it is perceived as a three-dimensional figure in the form of a cube, behind the vertices of which there are dark circles with a texture. Although the entire cube is visible, the lines appearing as edges are illusory. Our visual system, when perceived, also takes into account mutual overlap to determine the position of the figure. The main role in the formation of subjective contours is played by the perception of the central figure, which partially obscures or blocks the surrounding elements. The more pronounced the apparent overlap with the central figure (the silhouette of the palm and the girl), the faster the subjective contours appear (Figure 1.3.17). If the elements that cause the appearance of the contour are themselves identified as closed, independent figures, the effect of the subjective contour appears weaker. In this case, the finished shape of the overlapping elements causes an alternating perception of the central figure and the overlapping elements (Figure 1.3.18). The apparent overlap helps us to perceive subjective figures also on the basis of three-dimensional forms. With the help of gestalt cognitive perception, the flat pictures presented in Figure 1.3.19 acquire volume, create the illusion of composite three-dimensional forms [Abbasov, 2019].
Figure 1.3.17 Strengthening the subjective contour overlapping elements.
Perception of Images. Modern Trends 43
Figure 1.3.18 The weakening of the subjective central rectangular contour.
Figure 1.3.19 Three-dimensional subjective contours.
1.3.4 The Dependence of Perception on the Orientation of the Figure For us, additional recognition factors are the apparent orientation of the shape in a given context. As a sign of orientation, the observer takes into account the relative position of the top, bottom and borders of the figure. The unfamiliar shapes shown in Figure 1.3.20 seem to be different from the neighboring forms, although they are identical. The shapes on the right and on top due to the special position are recognized by us rather quickly; therefore, with the change of location, the perception of the figure also changes.
44 Recognition and Perception of Images
Figure 1.3.20 Perception of forms depending on orientation.
The ambiguous forms presented in Figure 1.3.21 in various combinations are perceived as shapeless spots, but they are created on the basis of the profile of a woman’s face, turned in various combinations [Arnheim, 1974], [Shiffman, 2008], [Abbasov, 2016].
Figure 1.3.21 Perception of ambiguous forms.
Perception of Images. Modern Trends 45
Figure 1.3.22 Recognition of the profile on the background and as a contour.
The perception process is influenced by past experience, memories, expectations, suggestion, and surroundings. Setting an observer under a certain perception of the surrounding world is called a perceptual setting. On Figure 1.3.22, on the left, four figures of irregular shape are shown, but as a profile of the face they are perceived in the right figure. The gestalt organization of perception takes into account the combination of “figure-background”, built on the factors of “good continuation” and closure.
1.3.5 The Stroop Effect The imposition of various installations based on the effect of J. Stroop is associated with the reading of textual characters in the presence of distracting factors. This effect is manifested in the fact that the subjects are slower to cope with the task, if they have to determine what color is printed on the word, which indicates a different color. For example, the word “blue” is printed in red paint (Figure 1.3.23); you can try to voice the color of words without being distracted by their context [Abbasov, 2019]. There is also the phenomenon of synesthesia (color hearing), when two sensory systems of sight and hearing are connected and influence each other in the final perception. A person with color hearing, when listening to music, sees color visual images. Such people are called synesthetics; they
Figure 1.3.23 The task for the Stroop effect.
46 Recognition and Perception of Images can read plain text without any problems, although each letter (sound) has a different color for them. When they are presented with a text of multicolored letters, they feel uncomfortable, they begin to stumble, the speed of reading slows down. For them, the randomly given color of the letter does not coincide with their perception of the color of the letter. In the history of music many famous Russian composers were synesthetics, including A.N. Scriabin and N.A. Rimsky-Korsakov. Frenchman O. Messiaen and Lithuanian musician and artist M.K. Churlionis were also synesthetics.
1.4 Space Perception 1.4.1 Monocular Spatial Signs The retina is a two-dimensional curved surface. Despite this, we can estimate the distance of objects in the three-dimensional world. The distance of objects is determined on the basis of some monocular and binocular signs. Spatial signs that are perceived by one eye are called monocular (or pictoral). Monocular spatial signs are mostly static, i.e., when the observer and objects of the field of view are fixed [Gippenreiter, 2002], [Shiffman, 2008], [Abbasov, 2016]. The perception of still scenes, photographs, art paintings is based on static monocular signs. Monocular signs convey the depth of space and distance by visual means, and the observer has an illusion of volume when viewing pictures. Size of objects. The first monocular feature is the relative size of objects; a smaller object is perceived as more distant. Interposition. Partial blocking or overlapping of one object by another is called interposition. The object located closer may overlap the objects located behind (Figure 1.4.1). Interposition to determine the relative distance of objects.
Figure 1.4.1 Flat and three-dimensional interposition of one scene.
Perception of Images. Modern Trends 47 Aerial perspective. When viewing the surrounding landscapes of objects nearby, we see more clearly than distant objects. This monocular source of information is called the air perspective; it arises due to the scattering of light on the smallest particles of the surface layer of the atmosphere. Shadow and Luminosity. Close to the light source surfaces of objects have the greatest luminosity. With distance from the light source, the luminosity of the surfaces decreases and their shading increases. The shadow that falls on the surface of the object itself, blocking the light, is called its own. If the shadow of an object falls on another surface, it is called a falling one. These shadows are important signs of the depth of the scene; they tell us information about the shape of objects, the distance between them and the location of the light source. The work of [Gonzalez, Niechwiej-Szwedo, 2016] describes the influence of monocular vision on the accuracy of coordination of hand movements at the moment of capture. It’s known that the moveable eyes are in motion up to the completion of the movement. The eyes moving to object is analyzed in the space and recognized thus simplifying the control of subsequent movements. However, the monocular vision doesn’t allow the complete determination of the position of the object and hands in the space. Fifteen people who carry out the control action using the video tracker and system of motion capture took part in the experiment. As a result the time for taking the bead and pulling it on the needle at the monocular vision was increased up to 2.5 sec, and in case of binocular vision it was 2 sec. The obtained results prove the defining role of binocular vision in the everyday living of man. The work of [Luke, Henderson, 2016] considers the influence of significance of visual stimulus content on eye movement. The text, photos of townscapes, landscape pictures and their analogs such as stylized pseudo stimuli were used as the visual stimuli. As a result it was determined that the duration of fixation and saccades amplitude were larger for pseudo stimuli as they need more time to be recognized and perceived. Linear perspective. In the perception of the depth of space, linear perspective plays an important role. Linear perspective provides for a gradual decrease in the magnitude of distant objects and distances between them. The most obvious example of a linear perspective is railroad rails (Figure 1.4.2) [Website istockphoto, 2020]. Despite the parallelism, it seems that the rails in the distance converge at a point called the vanishing point.
48 Recognition and Perception of Images
Figure 1.4.2 Linear perspective.
1.4.2 Monocular Motion Parallax Monocular motion parallax (from the Greek word paralaxis – change) is a monocular source of information about the depth and relative position of objects in view, resulting from the movement of an observer or objects. When the observer fixes his gaze on some point of the field of view, and at this time he moves, it begins to seem to him that objects lying closer to the fixation point move faster than more distant objects. During the movement of the head, the images of objects in remote locations on the retina will be mixed at different distances. It seems to the observer that more closely spaced objects move in the direction opposite to the direction of eye movement, and more distant objects move in the same direction [Shiffman, 2008], [Abbasov, 2019].
1.4.3 Binocular Signs For the perception of primary visual information, monocular signs of space play an important role. However, to accurately determine the depth of space requires the activity of both eyes. Binocular signs are spatial information that can only be obtained by perceiving the surroundings with both eyes.
Perception of Images. Modern Trends 49 Binocular disparity. Animals with a frontal arrangement of eyes (predators, primates) see a large part of the visual field with both eyes. Within the area of binocular overlap, two eyes receive slightly different images of the same three-dimensional visual information. The field of view of one eye is somewhat different from the field of view of the other. This difference between the two retinal images is called binocular disparity (binocular parallax). Disparity is strongly pronounced for closely located objects, and decreases as they are removed. After four meters, the difference between retinal images becomes insignificant, and binocular disparity weakens.
1.4.4 Binocular Disparity and Stereopsis Due to the binocular disparity of the images on the retina of both eyes, there arises a special perception of the depth of space and volume, which is called stereoscopic vision, or stereopsis (from the Greek Stereos – volumetric and opsis – vision) [Luria, 2006], [Shiffman, 2008], [Abbasov, 2016]. One of the most impressive examples of stereoscopic vision is the perception of the effect of depth when viewing slides with a stereoscope, that is, when presenting each eye with slightly different planar images of the same scene, called stereograms, the illusion of volume. Stereograms can also be created with a random set of black and white elements. Meaningful perception of the depth of these stereograms is possible only after the two images are combined in a certain central visual zone. When viewed with a monocular, these images lose depth and are perceived as uniformly arranged random elements. A special form of a stereogram containing two combined patterns for both eyes is called an autostereogram. An autostereogram is an unusual and difficult task for the visual system, since it is necessary to focus the eyes at a distance different from the one on which the drawing itself is located. But if you look with two eyes and do not suffer from stereo-blindness, then with a certain training session (and patience) you can see the stereoscopic image.
1.5 Visual Illusions Illusion is a false or distorted perception of a phenomenon. Visual (optical) illusions arise when the visual perception does not correspond to the physical, real properties of an external stimulus. Illusions can be considered distorted perception of the surrounding reality, and thus differ from
50 Recognition and Perception of Images hallucinations, which are a false perception that occurs in the absence of external conditionality. According to the mechanism of occurrence, visual illusions can be divided into the following types: –– arising from the imperfection of the optical properties of the eye; –– resulting from the action of the entire visual system, including the brain; –– dynamic illusions arising from a change in the position of the stimulus in space or in time. The survey of [King et al., 2017] is dedicated to the analysis of distortions of perceiving the visual illusions in case of confusional insanity. People suffering from confusional insanity may experience the anomalies at the different stages of processing of visual information. Generally, such people are unperceptive to the illusions of high level. Moreover, they may be influenced by some illusions of initial stage. Let’s consider the illusion of “Hollow mask” [Website youtube.com, 2020] taken as an example. The mask is turned both clockwise and counterclockwise upon view (the successive steps are shown in Figure 1.5.1). In fact, the mask is turned one way. However, for the majority of observers the mask changes the direction of movement at some point of time. At the start of movement the convex front side of the mask doesn’t make a difficulty for the visual system and consciousness. When the mask begins to move toward the back side (the backing may be observed through the hollow eye pits) our brain turns it out because the convex side is more familiar for
Figure 1.5.1 Sequenced frames of illusion of ‘Hollow mask’ [Website youtube.com, 2020].
Perception of Images. Modern Trends 51 us. We didn’t use to see people with concave faces. In the course of further turning the brain turns out and restores the usual convex form of the mask at the crossing line once again. However, the brain of scizophrene cannot be deceived; that’s why the mask is still hollow for it after turning. It is necessary to admit that visual illusion doesn’t have the same effect on some people who are under the influence of alcohol and narcotics. The initially given illusion was found out in the case of hollow mask of famous actor Charlie Chaplin. The inversion of reality and processing of visual information belong to the functioning properties of the brain of a healthy man. The study of principles of brain activity is a quite difficult field. There are no definitive explanations even to this date. What is more real – the perception of real-world view by the brain of scizophrene (without processing) or the visual information processed by the brain of a healthy man?
1.5.1 Constancy Perception As a result of evolution, our sensory system has adapted to an adequate perception of objects and phenomena of the surrounding world [Arnheim, 1974], [Shiffman, 2008], [Abbasov, 2019]. In the process of perception, surrounding objects are characterized by some relatively stable physical properties. Perceptual stability when changing parameters of physical stimulation is called perception constancy.
Constant Perception of Lightness The brightness of the surface of objects is characterized by the intensity of the light reflected by it, or luminosity. When the light intensity changes, the degree of lightness of a sheet of white paper practically does not change, despite the change in luminosity. It will also be perceived as white paper, but with less brightness.
Constancy Perception of Size The tendency to perceive the size of objects unchanged relative to the observer (the size of the projection on the retina) is a consequence of the constancy of the perception of size. As follows from Figure 1.5.2, a person sitting at the end of the corridor is perceived to be the same size as the one sitting in the foreground, although the size of the retinal image on the retina is several times smaller [Website istockphoto, 2020]. If you place the images next to each other, you can visually detect the resizing of seated human figures.
52 Recognition and Perception of Images
Figure 1.5.2 Constancy perception of size.
Constancy of Perception of the Form The shape of the object can also be perceived stably, regardless of the angle of view. This feature of the visual system is called the constancy of perception of the form. Familiar to us doors are perceived as rectangular shapes, despite the angle at which we look at them. However, their projection on the retina has a rectangular shape only if they are directly in front of the observer.
1.5.2 The Development of the Process of Perception To analyze the perceptual abilities of a person, some studies of abilities in infants can be noted. At the age of one month, babies cannot distinguish small details; their vision can only be distinguished by relatively large objects. Visual acuity is weak, facial expressions are difficult to distinguish, they see mostly the external contours of the face. Infants find some forms more interesting than others; they are more inclined to look at forms similar to human faces. By three months, the sharpness improves; the baby can already decipher facial expressions, and he has much more social reactions. He recognizes something in the face of the mother; when viewing he prefers to look at the photo of the mother, and not at the photo of an unfamiliar woman. The perception of depth begins to appear at about three months of age, but is finally formed only by six months. At the age of four months, the
Perception of Images. Modern Trends 53 infant begins to reach for the nearest of the two objects, determining which one is located closer to it, thanks to the binocular disparity. Constancy of perception also begins to develop from the first months of life.
1.5.3 Perception after Surgery Insight From medical practice it is known that a person born blind, after an operation to restore vision in adulthood, faces many problems due to the lack of spatial impressions about the outside world. The mature people were unable to distinguish even simple objects or forms [Gregory, 1970]. After a long workout, they learned to recognize the simplest of the visible objects. But many have not been able to learn this, so they returned to the previous life of a blind person. This process was accompanied by strong emotional experiences. However, active and intellectually developed people, on the contrary, after the epiphany, were fairly well adapted. Most patients had some repetitive processes: –– at first they see an objectively unorganized visual world, since there are a lot of unknown objects; –– they highlight the shape and background; –– the majority of patients are able to fix their eyes and follow moving objects; –– they have no objectivity of perception, they cannot determine the nature of the differences between objects; –– they have the ability to assess the spatial distance of objects. Vision is restored more quickly in people who are blind after birth than in those who are blind. An interesting observation of a 52-year-old patient who underwent an insight operation is given in [Gregory, 1970]. At first, he did not see anything except a vague outline, but after a few days he could already walk through the hospital, find out the time by the wall clock, and watch the movement of cars through the window. In the zoo, he could correctly name most animals. Identifying objects, he often used his tactile experience. He had a specific perception of distance; for example, he thought that he would be able to touch the ground with his feet, being at the window at a height of 12 meters. Sitting in the restaurant, he took the crescent moon for the reflection of a piece of cake in the mirror. He accurately estimated the distance and size of objects if he knew them well by touch. But he never learned to read with his eyes, although he immediately and without much difficulty began to distinguish between capital letters and numbers. He studied these letters at
54 Recognition and Perception of Images a school for the blind, and he did not perceive lower case letters. Therefore, it can be noted that he used his past tactile experience very effectively, which often limited the development of his visual perception. In the museum, he was shown a machine placed under a glass cap, and he could not recognize it, although he dreamed of seeing it. When he was to take off the cap, and he could feel it, he said: “Now that I felt it, I see it.” In general, he perceived the world as dark and vague, which greatly upset him. After the onset of depression, he ceased to live an active life and died three years later. As noted, in such cases, the development of depression is quite characteristic. The image of the world around him, based on tactile sensations, was not able to accommodate a huge mass of constantly incoming new sensory information. Perhaps this led to the development of information stress, to the destruction of the established picture of the world, and then to a prolonged depression. Studies on the role of congenital or acquired in the process of perception were performed on animals. A newborn animal is reared for some time in the conditions of sensory isolation, in darkness or in the absence of structured visual stimulation, and then its perceptual abilities are evaluated. It was found that in small young animals raised in the dark shortly after birth there is a perception of depth. They are usually observed underdevelopment or degeneration of cortical neurons, which indicates a violation of the analyzer system. As a result, it was noted that there are some innate mechanisms of visual perception in various species of animals. The work of [Chen et al., 2019] considers the problem of development of the portable image recognition system for people with defective vision. These people face questions relating to the recognition and identification of visual information every day. In this case the portable intelligent image recognition system based on the technology of cloud and local data processing is offered. The system is more economical as it uses the highly efficient algorithms of the cloud server and captures the objects constantly scanning the incoming video. The device was tested under real-life conditions by people with defective vision in the way that facilitates them to recognize faces and identify the necessary people.
1.5.4 Illusion of the Moon The illusion of the moon is manifested in the fact that when it is near the horizon, it seems to us that it is much larger than at its zenith, although its retinal images in both cases are unchanged in size (Figure 1.5.3). The angular size of the projection of the moon (also the sun) on the retina is about 0.5°.
Perception of Images. Modern Trends 55
Figure 1.5.3 Illusion of the Moon.
The illusion of the moon caused great interest from ancient times, and many scientists tried to explain it [Abbasov, 2016]. According to the remoteness hypothesis, any object separated from the observer by a filled space seems to be more distant than an object distant at the same distance against the background of empty space. The perceived distance to the horizon is estimated to be greater than the distance to the zenith. The images of the moon on the retina are the same in both cases, but when the moon is on the horizon, it seems more distant to the observer. In accordance with the hypothesis of relative size, the perceived size of an object depends not only on the size of its retinal image, but also on the size of nearby objects. The smaller these objects, the larger its apparent size. Above the horizon, the moon is perceived against the background of the surrounding landscape; at the zenith we perceive it against the backdrop of a visually free sky, a sign of remoteness “triggers the mechanism” of the constant perception of magnitude.
1.5.5 Illusions of Muller-Lyer, Ponzo, Poggendorf, Zolner Figure 1.5.4 presents the well-known geometric illusion of Franz MullerLyer (1889). There are many theories to explain this illusion, the hallmarks of a stimulus are the tips of linear segments, which are directed in different directions. The upper segment seems to be longer than the lower one, although they are physically identical.
56 Recognition and Perception of Images
Figure 1.5.4 The illusion of Muller-Lyer.
Figure 1.5.5 shows the illusion of Ponzo; the horizontal segments are the same in length, but against the background of the perspective lines, the distant segment appears several times longer. Figure 1.5.6 shows the illusion of Poggendorf and Zolner. Zolner’s illusion is that long parallel lines intersected by short diagonal lines seem to diverge. Poggendorf ’s illusion is based on the apparent displacement of diagonal segments intersecting parallel lines.
Figure 1.5.5 Illusion Ponzo.
Perception of Images. Modern Trends 57
Figure 1.5.6 Illusion of Poggendorf and Zolner.
1.5.6 Horizontal – Vertical Illusion Figure 1.5.7, on the right, shows a horizontal-vertical Wundt illusion. It seems to the observer that the vertical line is longer than the horizontal one, although these lines are equal in length. In order for the horizontal line to appear equal in magnitude to the vertical, it must be 30% longer (Figure 1.5.7, right).
1.5.7 Illusions of Contrast The illusion of contrast lies in the distortion of the perception of stimuli, due to the influence of the opposite, contrast or contextual stimulus. These include the illusions of Ebbingauz, Yastrov, Baldwin. The two inner circles are identical, but the area of the left central circle on the left seems larger because it is surrounded by smaller circles, and the area of the circle on the right is smaller, as it is surrounded by larger circles. Presented in Figure 1.5.8 on the right, Yastrov’s illusion also illustrates the effect of contrast on the perception of magnitude. The lower curved figure appears longer than the upper figure, although they are identical. It should be noted that any signal is rarely perceived in isolation; therefore the connections between the signal and its context can influence the process of perception. Figure 1.5.9 shows Baldwin’s illusion: two pairs of squares of different sizes are connected by jumpers of the same size. The central jumper
Figure 1.5.7 Horizontal Vertical Wundt Illusion.
58 Recognition and Perception of Images
Figure 1.5.8 Illusion of Ebbingauz and Yastrov’s.
Figure 1.5.9 Influence of context on apparent length, Baldwin’s illusion.
between the lower squares of a larger size seems a bit shorter than the one between the upper squares. The squares bounding these lines affect the perception of their length and lead to its distortion. Therefore, we cannot neglect the existing context, affecting the perception of the main signal. In Figure 1.5.10, on the left are the illusions based on the contrast of the slope; the vertical lines appear to be inclined in the direction opposite to the slope of the surrounding background lines. The effect of contrast is illustrated by the illusions of Wundt and Goring (Figure 1.5.10, right). Both pairs of horizontal lines are straight lines parallel to each other, but in Wundt illusions, they seem to be “bent” inward, and in Goring’s illusion, outward. A similar distortion of perception as a result of the contrast between two adjacent fragments is also characteristic of the Fraser illusion (Figure 1.5.11) [Website istockphoto, 2020]. The contrast between adjacent white and black rectangles distorts the perception of physically parallel lines in the illusion of Münsterberg (Figure 1.5.12). The illusion of non-parallelism is strong enough, it can only be destroyed with the help of a ruler. Many visual illusions are the result of the combination of various illusory effects. For example, the Zander parallelogram presented in Figure 1.5.13 also contains the components of the Muller-Lyer illusion [Abbasov, 2019].
Perception of Images. Modern Trends 59
Figure 1.5.10 Illusions based on contrast tilt.
Figure 1.5.11 Fraser’s Illusion.
Figure 1.5.12 Illusion of Münsterberg.
60 Recognition and Perception of Images B
C
A
Figure 1.5.13 Zander parallelogram, AB and AC are equal.
1.6 Conclusion The survey of some current lines of research in the field of visual information perception is represented in the first section. The peculiarities of eye anatomy of vertebrates, the structural elements of the visual system, the process of senses organization and perceptual organization of visual information by the brain are analyzed. The second section is dedicated to the visual system and types of eye movements. The structure of eye bulb, types of eye movements and current methods and means of eye movements recording are described. Moreover, the questions relating to the perception of contour, light contrast and types of color sensation disorders are discussed. The third section considers the process of perception of figure and ground. Some current studies are presented, and the concepts of figure and ground are defined. The factors of gestalt grouping, subjective contours and recognition of images in space are regarded. The fourth section is dedicated to the perception of external environment, peculiarities and signs of monocular and binocular vision. The fifth section analyzes the process of perceptual organization and constancy factors, as well as kinds of visual illusions and reasons of their occurrence.
References Abbasov I.B. Fundamentals of graphic design on a computer in Photoshop CS6 (textbook, third edition) M.: DMK Press. 2013. 238p. Abbasov I.B. Visual perception: a tutorial. Moscow: DMK Press, 2016. 136p. ISBN: 978-5-97060-407-6 Abbasov I.B. Psychology of Visual Perception. Amazon Digital Services LLC, 2019. 141p. ASIN: B07MPXKLQ7 [Electronic resource] Aliyeva N.Z. Visual illusions. Rostov n/D.: Phoenix, 2007. 333p. Arnheim R. Art and visual perception. M.: Progress, 1974. 384p. Barabanschikov V.A., Zhegallo A.V., Ivanova L.A. Recognition of expression of inverted face image //Experimental Psychology, 2010, V.3, №3, P.66–83
Perception of Images. Modern Trends 61 Basyul I.A., Demidov A.A., Diveev D.A. Regularities of oculomotor activity of Russians and Tuvans in the assessment of perceptual confidence by facial expressions. //Experimental Psychology. 2017. V.10. №4. P.148–162. doi:10.17759/exppsy.2017100410 Buhmann J.M., Malik J., Perona P. Image recognition: Visual grouping, recognition, and learning //Proceedings of the National Academy of Sciences of the United States of America, 1999, V.96, № 25, P.14203–14204, doi:10.1073/ pnas.96.25.14203 Butenko V.V. Analysis of methods and systems for recording oculomotor activity. Engineering: problems and prospects: proceedings of the IV International Scientific Conference. SPb. Own publishing house, 2016.134 p. P.1-5 Chen Xia, Rong Quan. Predicting Saccadic Eye Movements in Free Viewing of Webpages //Predicting Saccadic Eye Movements in Free Viewing of Webpages. IEEE Access. 2020. V.8. P.15598-15610. doi:10.1109/ACCESS.2020.2966628 Eremina D.A., Shelepin Yu.E. The dynamics of psychophysiological indicators of visual perception of patients during the rehabilitation after coronary artery bypass grafting (evidence from recognition of fragmented images) //Vestnik SUSU. Series “Psychology”, 2015, V.8, №1. P.113–120 Escher M.C. The graphic work of M. C. Escher. New York, 1961, https://www. moma.org/artists/1757 Date of access 27.04.2020. Fershild M.D. Color perception models. Ed. Rochester Inst.of Technology, USA, 2004. 439p. Gonzalez D.A., Niechwiej-Szwedo E. The effects of monocular viewing on handeye coordination during sequential grasping and placing movements //Vision Research, 2016, V.128, P.30-38, doi:10.1016/j.visres.2016.08.006 Gray B. Creative paintings. www.brucegray.com/images/prism.jpg . Date of access 26.04.2020. Gregory R. Eye and brain. Psychology of visual perception. M.: Progress, 1970. 272p. Gusev A.N. General psychology: in 7 volumes: A textbook for students. Higher institutions /ed. B.S. Brotherly v.2. M., Ed. Center “Academy”, 2007. 416p. IDoliotis P., Tsekouras G., Anagnostopoulos C.-N., Athitsos V. Intelligent Modification of Colors in Digitized Paintings for Enhancing the Visual Perception of Color-blind Viewers //IFIP International Federation for Information Processing, 2009, V.296, P.293–301. Itten I. Art of color. M: D. Aronov, 2001. 138p. Junji Ito, Yukako Yamane, Mika Suzuki, Pedro Maldonado, Ichiro Fujita, Hiroshi Tamura, Sonja Grün. Switch from ambient to focal processing mode explains the dynamics of free viewing eye movements //Scientific Reports. 2017. №7, Article 1082. 14p. doi:10.1038/s41598-017-01076-w Kassan A. Human Anatomy. Illustrated Atlas. Kharkov. Ed. LLC “Family Leisure Club”, 2011. 192p. King D.J., Hodgekins J., Chouinard P.A., Chouinard V-A., Sperandio I. A review of abnormalities in the perception of visual illusions in schizophrenia
62 Recognition and Perception of Images //Psychon Bull Rev, 2017, №24, P.734–751, doi:10.3758/s13423-0161168-5 Ghosh K., Bhaumik K. Complexity in Human Perception of Brightness: A Historical Review on the Evolution of the Philosophy of Visual Perception //OnLine Journal of Biological Sciences. 2010. №10. P.17-35. doi:10.3844/ ojbsci.2010.17.35 Luke S.G., Henderson J.M. The Influence of Content Meaningfulness on Eye Movements across Tasks: Evidence from Scene Viewing and Reading //Frontiers in Psychology, 2016, V.7, Article 257, 10p. doi:10.3389/ fpsyg.2016.00257 Luria A.R. Lectures on general psychology. - SPb.: Peter, 2006. - 325p. Maglione A.G., Brizi A., Vecchiato G., Rossi D., Trettel A., Modica E., Babiloni F. A Neuroelectrical Brain Imaging Study on the Perception of Figurative Paintings against Only their Color or Shape Contents //Frontiers in Human Neuroscience, 2017, V. 11. Article 378, 14p, doi:10.3389/fnhum.2017.00378 Official site http://www.istockphoto.com Date of access 07.04.2020. Pepperell R. Problems and Paradoxes of Painting and Perception //Art & Perception, №7, 2019, P: 109–122, doi:10.1163/22134913-20191142 Prokopenko V.T., Trofimov V.A., Sharok L.P. Psychology of visual perception. Tutorial. SPb: SPbGUITMO, 2006. 73 p. Psychology of sensations and perceptions, Ed. Gippenreiter Yu.B. M.: CheRo, 2002. 610p. Repin I.E. Did not wait [Electronic resource]. State Tretyakov Gallery - www.tretyakovgallery.ru Date of access 28.04.2020. Shiffman H.R. Sensation and Perception. John Wiley and Sons Ltd; 6th Edition. 2008. 608p. Shiwei Chen, Dayue Yao, Huiliang Cao and Chong Shen. A Novel Approach to Wearable Image Recognition Systems to Aid Visually Impaired People // Journals Applied Sciences, 2019, V.9, Issue 16, Article 3350, 20p. doi:10.3390/ app9163350 The effect of the hollow mask: https://www.youtube.com/watch?time_ continue=98&v=sKa0eaKsdA0&feature=emb_logo Date of access 25.04. 2020. Tikhomirov G.V., Konstantinova I.O., Cirkova M.M., Bulanov N.A., Grigoryeva V.N. Visual Object Agnosia in Brain Lesions (Review) //Sovremennye tehnologii v medicine, 2019, V.11. P.46-52. doi:10.17691/stm2019.11.1.05 Tobii REX, Tobii EyeX, Tobii TheEyeTribe – Gaze tracking technology [Electronic resource] URL: http://www.tobii.com/ Wegner-Clemens K., Rennig J., Magnotti J.F., Beauchamp M.S. Using principal component analysis to characterize eye movement fixation patterns during face viewing //Journal of Vision November 2019, V.19, №2. P.1-15, doi:10.1167/19.13.2. Yarbus A.L. Eye Movements and Vision. 1967. New York: Springer. 152p.
2 Image Recognition Based on Compositional Schemes Victoria I. Barvenko* and Natalia V. Krasnovskaya Southern Federal University, Academy of Engineering and Technology, Department of Engineering Graphics and Computer Design, Taganrog, Russia
Abstract
This chapter is devoted to the study of works of fine art in order to recognize and process images. The search for composite schemes is used here and it is based on the idea that a drawing is a graphic basis of any image. And the image is the result of reconstruction of an object in the mind of a person and can be embodied in iconic models. Artistic images in visual art can be reduced to formal attributes, given that the structure of the image can be correlated with the source of the image. The artistic image forms a new probable reality, synthesizing visibility, concreteness and abstraction. The artist creates various models and constructions in the material of art by virtue of constructive thinking through the imagination. These images can be classified and recognized by systems that determine whether they belong to the style or the author. Keywords: Conceptual model, compositional scheme, image, constructive thinking, artistic perception, associative coupling, type and symbol, recognition
2.1 Artistic Image An artist who creates a work of art obviously has the ability to reflect emotionally, perceiving the world around him or her and transforming it into artistic images and forms. It is the composition that underlies all the creativity types. Having professional knowledge of how creativity impacts a viewer, an artist consciously uses compositional schemes. These allow *Corresponding author: [email protected] Iftikhar B. Abbasov (ed.) Recognition and Perception of Images: Fundamentals and Applications, (63–88) © 2021 Scrivener Publishing LLC
63
64 Recognition and Perception of Images the artist to create the necessary expressive tension in the picture plane or space, a new artistic reality, forming a creative experience. Compositional schemes are universal tools that have been formed over thousands of years. A convincing example of such a scheme is the so-called “ancient canons” – drawings and charts that can be found in history museums, archaeological museums, open-air museums, libraries, and which are well known and accessible not only to specialists, but also to all curious tourists (Figure 2.1). “An image is the result of reconstruction of an object in the human mind; a concept that is an integral part of philosophical, psychological, sociological and aesthetic discourses” [Academic Dictionary, 2020]. An artistic image can be materialized in some practical actions, language, or symbolic models. For example, in visual art, images can be reduced to formal features, they form a new reality, synthesizing visibility and abstraction. They create various models and constructions in the material of art by virtue of the constructive thinking of the artist through the imagination [Academic Dictionary, 2020]. Drawing is a graphic basis of any image. Any artist is familiar with the process of working on a composition, when impressions are recorded in generalized concise images. The process of filling the compositional plane is
(a)
(b) ε 45 5 5
3
B
15
5
9 610
O'
15 45 2R
6
6
O
6 Г 15
15 2 15
15
15 2
15 2
15
15
A' 15
A C
75
(c)
(d)
Figure 2.1 Ancient canons: (a) Acropolis. (b) Propylaea of the Acropolis of Athens. (c-d) Compositional scheme. The Proportions [Milonov, 1936].
Image Recognition Based on Compositional Schemes 65 based on the principle of “from general to particular” and generalization – in the final stage of the work (in reverse order) [Favorsky, 1966]. If you look at this process a little more closely, you can pay attention to a number of actions that the artist performs when thinking about the future work. Starting intuitively with the choice of the location of the main axes (vertically or horizontally, or diagonally), with the brief sketch, which is similar to a series of random lines, rhythmic spots and strokes, then the artist discovers by “groping” the composition by the search of movement of plastic masses of tone, color. By virtue of analytical thinking there follows the rationalization, a conscious action, resulting in a well-thought-out, built-up compositional decision. This is almost always visible to the observer. In the artist’s arsenal, there are not so many compositional schemes in fact, especially if the artist does not work imitatively. Compositional schemes were developed over a long period, passing through stages of development that researchers associated with the “artistry” of perception already among ancient peoples. They worked with the materials for a long time, transforming and giving them the desired shape. Pictorial conditionality should correspond associatively with real forms or other forms expressed in the graphic conditionality of their expression. Ancient people were able to create graphic forms spontaneously: a line, a spot, a relief. It is important that in reverse order they could “perceive these signs, find similarities between them and the real world” [Miller, 2014]. Processing the material presented a number of difficulties (for obvious reasons). A human sought a form partially similar to the desired form in nature and then refined it, making it similar to animals, birds, or other images. Examples of pictorial activity of ancient people that were discovered in caves give an understanding of the analogies that they used. Outlines of drawn animals, their position (almost always vertical), images of red spots on the body of animals. These spots can be surely associated with wounds of the animal during the “jump”, which we attribute to the characteristic and recognizable outline of the animal’s back. This indicates “significant human development in the direction of evaluating the parallelism of conditional forms with the forms of real objects in his surroundings” [Miller, 2014]. In the same row, it is possible to place the first abstract drawings of children when they are told how to draw a person: “dot, dot, comma, came out a crooked face, hands, legs, oval, that’s the little human.” An artistic image, which is both a means and a form, belongs to the General category of artistic creation, and a priori possesses expressiveness. In antiquity and the Middle Ages, it was primarily consistent with
66 Recognition and Perception of Images the Canon, in the aesthetics of the Renaissance it was endowed with the creative energy of the Creator’s personality. The seventeenth century in the conditions of utilitarianism formed an artistic image on the principle of internal purpose, not external use (according to Kant) [Academic Dictionary, 2020], [dic.academic Website, 2020]. The nineteenth century describes artistic creativity as the work of thought, emphasizing spirituality in art, differentiating it from the scientific and conceptual type of thinking. The category of an artistic image was formed in Hegel’s Aesthetics precisely in the process of opposition: the image “... reveals to our eyes not an abstract essence, but its concrete reality...” [Hegel, 1971]. Hegel’s “thinking in images” implied a clear demonstration of a general idea that called into question the universality of the category of artistic image. The positivist intellectuals took the idea off the table of the artistic image, the formalists dissolved the concept of image in the concept of form (as well as construction). Aesthetic utilitarianism in the twentieth century became “the reverse side of formalist and constructivist theories” [Academician. Dictionary, 2020]. The inner form (of an artistic image), clearly fixed in the fundamental moments, sets the direction of our imagination, sets the boundaries, but does not give complete certainty. This kind of schematicity is not sufficient on the one hand, but at the same time it is the source of its independent life and play. This is its advantage, since the ambiguity of interpretation makes up for it. Any internal form bears the stamp of the personal author’s subjectivity, its “truth” is relative. But from the artistic side, the image represents “the arena of the ultimate action of aesthetically harmonizing, completing and enlightening “laws of beauty” (Figure 2.2), which determine its belonging to the “true” world of eternal life values, ideal possibilities of human existence [Philosophical Dictionary, 2020].
Figure 2.2 The proportions of the “Golden section” are the ultimate expression of the “laws of beauty”.
Image Recognition Based on Compositional Schemes 67 The structural diversity of the artistic image in a highly generalized form is reduced “to two initial principles – the principle of representative selection and the principle of associative coupling. At the ideological and semantic level, these two structural principles correspond to two types of artistic generalization – type and symbol” [Losev, 2020]. The art critic N. N. Volkov writes in his research: “Under the composition of a picture we consider the construction of the plot on a plane within the boundaries of the ‘frame’”. The purpose and formative principle of the composition of the picture is, however, not the construction itself, but the meaning (Figure 2.3). The construction performs the function of giving meaning. A composition can be a creation of a new reality and an interpretation of an existing reality. The plot and its presentation on the plane of the picture generate meaning and image as a whole as a result of combining themselves [Volkov, 1978]. The ability to create visual images using composition has been described many times in the literature. But the main ones are several ways of construction: organization of space, identification of the main thing, coloristic
Figure 2.3 Benozzo Gozzoli “Procession of the Magi”. A fragment of the painting of the chapel of Magi in the Palazzo Medici Riccardi in Florence (the entire painting occupies three walls of the chapel), 1459-1462.
68 Recognition and Perception of Images solution as the basis of the figurative beginning and expressiveness, which is the main influencing force of the composition. The organization of space involves the selection of certain factors: –– diversity (foreground, middle, and far-distance view); –– intersection of plans: rhythmic division of horizontals and verticals; –– depth image on a plane, internal dynamism. The effect of movement is usually introduced into the picture plane using diagonal lines, the free space in front of the moving object. In painting it is coloristic density and change in the “temperature” of color; –– rhythmic division of the plane: plastic, tonal, color. The rhythm can be set by lines, color spots, and the play of light and shadows. Symmetry and asymmetry; –– the places of the main forms, their relations, format and size are worked out in the brief sketch, which can already be considered as a compositional scheme at the preliminary stage. Identification of the main thing in the composition is solved using several techniques based on the allocation of the center of psychological attention. Detection of color tension; contrast and tone selection: subordination of the entire state to the main one, detailing the center elements or generalizing secondary ones. Subordination of the secondary to the main one is one of the most important conditions of the composition, sharpening the character of the main motif. Color is the most important component in solving compositional problems. Color is the basis of the imaginative principle when artists create states; it is the emotional content of the idea. The expressiveness of the composition is inevitably associated with the search for new solutions: –– the degree of conditionality, the search for a plastic language. It is not the recognition but the expressiveness and sharpness of the composition are crucial; –– contrast as a universal tool, the most important principle of composition. Contrast values, volumes, textures, planes, tonal and color contrasts. Contrast of new and old, statics and dynamics, etc. [Chekantsev, 2015].
Image Recognition Based on Compositional Schemes 69
2.2 Classification of Features The problem of selecting informative features is an important part of the problem of pattern recognition. It is closely related to the task of preliminary classification [Chichvarin, 2020], [Favorsky, 1966]: –– the task of automatic classification is to divide a set of objects, situations, and phenomena according to their descriptions into a system of disjoint classes (taxonomy, cluster analysis, and self-learning); –– task of selecting an informative set of features for recognition; –– the task of reducing the source data to a form that is convenient for recognition. Success in solving the problem of pattern recognition depends largely on how well the features are selected. The initial set of characteristics is often very large. At the same time, an acceptable rule should be based on the use of a small number of features that are most important for distinguishing one image from another. Internal compositional tension, which is inherent in high-level works of art, is devoid of decoration, it is visualized within certain boundaries, i.e., the viewer really catches, recognizes the boundary of changing the shape of spots or changes in color tone, brightness, contrast. The “inner life of the picture” can be conditionally reflected by schematic symbols that fit into ascetic images, such as, for example, simple geometric shapes – an elementary expression of the form – primitives. The structures of elements that are united in a single compositional field, form and organize its expressiveness, are arranged in “chains” – concrete, connected and ordered. “Structural expressiveness is only a concretization of the above-mentioned regularity of order”, the so-called symbol [Losev, 1995]. The outer side can’t be anything. “It should also be something more or less significant. For the time being, let us consider only one thing: every symbol of a thing, person, or event is their expression in the case of the noticeable significance of these objects of the symbol; but not every expression is necessarily a symbol. Or, rather, the elementary expression of every object is also a kind of symbol, but a symbol in a rudimentary, undeveloped form. Or simply put, only a sign, that is, an initial and primitive indication of a particular object, without any specific development... “[Losev, 1995], [Gerchuk, 2016].
70 Recognition and Perception of Images This property of a symbol enclosed in some general form by a combination of simple geometric shapes indicates the hidden, latent, or implicit. The content can be used for recognizing copyright images through compositional schemes that are typical of the creative manner of an author (Figure 2.4). The symbol is not an artistic imagery, it is “declared as ... a function that can ... be displayed ... for the purpose of its logical construction [Losev, 1995]. “Artistic imagery, ... is never complete without a symbol, although the symbol here is quite specific. To remove symbolism from an artistic image is to deprive it of the very object of which it is the image. The symbol in any artistic imagery may also be the subject of its construction and also its generating model. But the whole point is that a pure artistic image, taken in isolation from everything else, constructs itself and is a model for itself (Figure 2.5) [dic.academic website, 2020]. There are both the general (idea) and the individual (sensuous realities), and this general also generates from itself a finite or infinite semantic series of singularities” [Losev, 1995].
Figure 2.4 Dance movements are recorded in the scheme (the beginning of the twentieth century).
Image Recognition Based on Compositional Schemes 71
Figure 2.5 Surreal modern painting Ettore Aldo Del Vigo, compositional analysis.
2.3 Compositional Analysis of an Art Work The formal and logical generalization of the singularities of an artistic image allows merging symbols charged with meanings into a single indivisible whole and breaking them into separate elements, creating new structures that can be reinterpreted using other (geometric) forms for analysis [Losev, 1995]. The search for suitable images based on a sample is a solvable task in cases where objects are characterized and grouped by color, shape, position, their distinctive features and their combinations.
72 Recognition and Perception of Images Simple geometric shapes can be used to define objects by shape when classification primitives are set: round, elliptical, rectangular, and rectilinear objects. This is true for the analysis of images of various kinds and character: architectural and iconographic canons (schemes); when analyzing the composition of landscape paintings, plot (genre) paintings (biblical, mythological, allegorical). As well as when analyzing compositions of modernist and postmodern art works made in mixed techniques with a share of borrowing and “quotation” (collage, assemblage, where images of famous artifacts or works by other authors are used in fragments or in whole). For example, below is the author’s work performed in the collage technique “Alexander and Bucephalus” (Figure 2.6), a collective image of the great commander and his favorite horse, the great army, minted coins with portraits in different periods of time, artifacts dedicated to the battle with the Persians and a military campaign in India, warriors – heroes of computer games (Figure 2.7). The collected works of sculptors, artists and designers, made in different eras, are united by the greatness and glory of Alexander the Great and his faithful horse Bucephalus. In this composition, in addition to the semantic center loaded with details, a significant number of elements are used being similar in shape
Figure 2.6 V. I. Barvenko “Alexander and Bucephalus”, hardboard, mixed technique, 80x80cm, 2013, and a detail - coins minted during the life of Alexander the Great with a scene of taming Bucephalus and the profile of the king in a military helmet.
Image Recognition Based on Compositional Schemes 73
Figure 2.7 Fragments of the work by V. I. Barvenko “Alexander and Bucephalus”, collected from a variety of artifacts dedicated to the victories of the great tsar, commander, and his army.
and color, compositionally arranged in a circle in their main mass in a picture plane in the format of a square. Taking the research and analysis of the author’s work as a whole and its elements as an example, the problem of image search by sample arises, which is a part (sub-task) of the more general problem of image recognition [Chichvarin, 2020].
2.4 Classification by Shape, Position, Color When searching for “similar” objects from a set of objects in an unsystematic and undirected way, they can be enumerated indefinitely and not come to completion with a given probability (Figure 2.7, on the left). In particular cases, objects are characterized by such identification parameters (attributes) as shape, color, position, mobility, distinctive features, their combinations, etc. Depending on these factors, the objects are subjected to classification. The most common classification features are presented below [Chichvarin, 2020].
Classification According to the Form In a composition, on the picture plane, when the task is to classify objects by shape using a compositional scheme. Experts intuitively identify objects
74 Recognition and Perception of Images that can be defined formally using classification primitives – objects that are round, elliptical, rectangular, or in the form of a straight line. There are two ways to search for similar objects: –– by pattern (for example, the coin is round – the method is not always convenient). Keep in mind that the template image is not dynamically scalable. If the object in the frame is slightly smaller or larger than the template image, it will most likely not be highlighted. The solution to this problem can be to search for objects based on an analytical dependency that describes their shape [Chichvarin, 2020]. –– determining the likely location. Most often, the difference between the brightness values of the template images and the analyzed frame is used (Figure 2.8). Classification by position applies to searches in paintings and architectural images, etc.
Search Methods by Position When searching for faces or other body fragments in the frame area, many other assumptions are made about the relative location of objects. If you put easy-to-find labels on an object, or some details that are originally contained by the object, then they are much easier to classify than the entire object as a whole. When you find these labels or parts, you can classify the object that contains them. That is, if there is a stable method of highlighting
Figure 2.8 Fragments of the work by V. I. Barvenko “Alexander and Bucephalus”, objects of different sizes and brightness that are similar in shape.
Image Recognition Based on Compositional Schemes 75
Figure 2.9 Analysis and determination of probable forms.
in the frame, for example, a person’s eye or nose, then you can make an assumption from these details where everything else is located.
Classification by Color Any object has not only a shape, but also a color that can be classified, regardless of whether the object is permanently colored or the color can be clearly displayed (Figure 2.9). Moreover, due to the fact that there are many bases for representing color components (RGB, YUV, YCrCb, HSV, etc.), it is not uncommon for a given object to be classified almost infallibly in a particular basis.
Figure 2.10 Fragments of the work by V. I. Barvenko “Alexander and Bucephalus”, image (horses, coins, sculptures) – a classification group that unites (distinguishes) a certain group of objects by some attribute.
76 Recognition and Perception of Images However, information about which basis to use and how to better organize the search for an object can often be obtained only experimentally [Chichvarin, 2020]. Images have a certain characteristic property that people relate only to this object. We will not confuse a raccoon with a fox, and a lemon with a lemon grenade. Images have characteristic, independent properties that can be classified [Condorovici et al., 2015]. The method of assigning an element to an image is called the deciding rule. Metrics is another important concept; it is a method for determining the distance between elements of a universal set. The smaller the distance, the more similar are the objects (symbols, sounds, etc.) that we recognize. Some examples of image recognition tasks related to visual images: • letter recognition; • face and other biometric data recognition; • image recognition.
2.5 Classification According to the Content of the Scenes Architectural Schemes Architectural-figurative solutions can be determined by the composition (proportions) of the external or internal appearance (Figure 2.10), architectural structures (or plan) in a painting or mural, having an outstanding, memorable appearance, bright design solutions, color or texture solution, etc. Architectural creativity is defined by functions, compositional types, and genres. It defines buildings with an organized internal space, structures that form open spaces, ensembles, technical design systems, and the artistic framework of architectural structures (Figure 2.11). Buildings are classified into groups that are divided according to their purpose into civil (residential and public), industrial (manufacturing) and special purpose.
Analysis of Landscape Scenes The composition of landscape paintings from the point of view of landscape art implies the possibility of a picture, image or photo of a modern
Image Recognition Based on Compositional Schemes 77
Figure 2.11 Michelangelo Buanarotti. Architectural design of the tombstone of Pope Julius II Della Rovere, Rome.
time to establish the appearance or determine which park is depicted in the work [landscape website, 2020]. The concepts of “landscape painting”, “park painting”, and “landscape” in the modern terminology of landscape art can be considered synonymous. Landscape painting – a landscape – is a part of the park space that is visually separated from the General Park space, conditionally enclosed in a “frame” that restricts the field of vision, and has a certain compositional structure. Unlike paintings in landscape painting (Figure 2.12), a park landscape painting has a three-dimensional space and can be perceived from different points of view (Figures 2.14 - 2.15). It occupies a certain space within the park landscape, limited by certain boundaries and visibility conditions. The park picture often includes elements of other landscape areas in its composition and thus is a means of organizing the unity of the composition of the landscape (the park). The variety of landscapes in the visual arts can be reduced to 17 compositional schemes-models (Figure 2.13), which are built images of landscape scenes.
78 Recognition and Perception of Images
Figure 2.12 Analysis of architectural public structures. The Colosseum and the Fontana di Trevi in Rome. Modern prints from old images of famous architectural monuments, Uffizi bookshop, Florence (personal collection of the author V. I. Barvenko).
Figure 2.13 Analysis of landscape scenes of the painting in landscape painting with the image of Istanbul (Constantinople) (personal collection of the author V. I. Barvenko).
The main means of constructing a landscape on a picture plane are: rhythm, symmetry, asymmetry, contrast, nuance, scale, proportionality, space. At the same time, space plays a major role in the composition: it is the main factor in the formation of the composition structure (Figures 2.14, 2.15) [Wang Z. et al., 2019].
Image Recognition Based on Compositional Schemes 79 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Figure 2.14 Compositional schemes-landscape models [landscape website, 2020].
Figure 2.15 Piazza del Poppolo. Rome. Italy. Reprint from an antique French original (personal collection of the author V. I. Barvenko).
Each park object is a single composition consisting of its own territorial units – landscape areas, the boundaries of which are distinguished not only by the similarity of spatial characteristics and physiognomic features, but also by the compositional design and artistic image. Together with the terms “landscape” and “landscape picture”, the terms “view” and “frame” are often used. The theory of composition in painting, developed in the middle of the twentieth century, allowed to establish general patterns in the composition of landscapes, depicted and natural ones, as well as to justify the unity of principles and means of composition of paintings.
80 Recognition and Perception of Images
Figure 2.16 Analysis of landscape scenes from a reprint of an old French engraving (personal collection of the author V. I. Barvenko).
Composition in paintings is built depending on the factors that determine it. According to N. Volkov›s definition, they are: 1) a composite knot pattern, i.e., the center, where the main subject of the image was placed (the node pattern will not necessarily be in the center of the picture); 2) the division of the field of view, ensuring the separation of important elements of the picture from each other; 3) the integrity of the field pattern provided by connection of the secondary elements of the picture with the main composite node. These factors play the same role both in the composition of landscape paintings and in the perception of landscape in space [landscape website, 2020].
2.6 Compositional Analysis in Iconography One of the most important symbols in iconography, which has reached today without losing its sacred meaning, is the Crucifixion [livejournal website, 2020]. Ancient images give an understanding of the development of the iconographic scheme (Figure 2.16). The magic gem from Giza is the first chronological representation of the Crucifixion (British Museum), second to third century [livejournal website, 2020]. A drawing of the Crucifixion from the London Oriental Manuscript, circa 600, and an illustration of the Crucifixion from the gospel of Rabbula, circa 586, show a different scheme for depicting the biblical story (Figure 2.17). The manuscript is a Coptic incantation book. The image of the Crucifixion illustrates “the Prayer that Jesus Christ uttered on the cross”. Here, for the first time, the crucified Jesus and the robbers are depicted in
Image Recognition Based on Compositional Schemes 81
Figure 2.17 Gem from Giza, II-III centuries.
different poses. His arms are outstretched, while the hands of the robbers are behind their backs. Recently, it has been suggested that miniatures of the Gospel were created earlier and then simply embedded in it. Jesus is dressed in a long colobium. The hands and feet are nailed to the cross, and the hands of Jesus, as well as those of the two robbers, are spread out at right angles [livejournal website, 2020]. In Europe, the sensuous depiction of religious subjects – full-color painting– was widespread. An example is the work “Holy Family” by El Greco (Figure 2.18), an artist who absorbed the school of Venetian Renaissance painting. Ancient Russian artists already followed the Byzantine iconographic Canon by the tenth to eleventh century. An icon is not a picture; it is an object of prayer, a sacred object (Figure 2.19). The location of objects in the
82 Recognition and Perception of Images
Figure 2.18 (a) a drawing from the London Oriental Manuscript, circa 600, (b) the Crucifixion from the Gospel of Rabbula, circa 586.
Figure 2.19 E. Greco’s “Holy Family”. Compositional analysis.
icon is extremely important. The same board can display events of different times. The icon compositions are characterized by an almost geometric clarity that neither frescoes nor miniatures underwent. The compositional structure is associated with a certain position of the figure on the plane of the icon, a certain gesture, the Christian attributes in the hands of the image:
Image Recognition Based on Compositional Schemes 83
Figure 2.20 “Do Not Weep for Me, Mother” (Mary at the tomb of the Christ. Symbolic Sunday) and “Saint Spyridon, Bishop of Trimythous” (revered as a wonder-worker, previously in Russia his image was more widespread than the image of St. Nicholas).
the Cross (Martyr) or the Gospel (Bishop). The image of a round halo (in the early Christian period, secular characters were often depicted with square halos), strongly distinguished the Christian image from the secular portrait (Figure 2.19). When analyzing icons, special attention is paid to the coloristic solution of the entire work, the manner of writing, the nature of the lines, and the features of the drawing, which was expressed in the proportions of figures and facial features. They did not depend so much on the Canon, but they depicted the stylistic features of the time when it was created, and reflected the author’s personality. The icon painters were not copyists; the perspective in ancient Russian painting suggested several points of view, with emphasizing and leading to the accentuation of the main thing. The selection of the Central character of the composition is marked by the fact that all the minor figures as well as the Central location refer to it. The size of the character depends on the position of the character in the hierarchy of Christian saints. Each color of the icon bears a symbolic meaning [lihachev website, 2020].
2.7 Associative Mechanism of Analysis In the history of culture and art, there are a number of outstanding names of artists and musicians, scientists who had the ability to
84 Recognition and Perception of Images create intersensory associations, the so-called synesthetics. The issues of interpretation of the most common audio-visual synesthesias have been studied by various researchers since the time of Aristotle and the Pythagorean school, where the colors of the spectrum were equated to seven musical tones. The design of the color harpsichord was invented in the seventeenth century by the monk L. Castel; this idea was implemented by A. N. Scriabin in the form of color music in the early twentieth century. Using scientific experiments on the surface of a vibrating plate with Chladni figures, German physicist and acoustician Ernest Chladni proved the relationship of these two concepts and showed how to visualize music. [soundtimes website, 2020]. Composer N. A. Rimsky-Korsakov was called “the greatest landscape painter in music” [Teplov, 2003]. Writer V. Nabokov was able to translate impressions from words into color and sound within the framework of synesthesia [Migunov, 2020]. German psychologist G. Fechner in his 1876 work “Preschool of Aesthetics” identified six principles, and I would like to draw attention to one of them - the principle of “aesthetic associations”, which establishes the significance of the associative factor in aesthetic perception. According to G. Fechner, the associative factor creates a subjective meaning and specific expression that embody a certain aesthetic content in the object of perception [Loseva, 2018]. It is worth mentioning Wassily Kandinsky, the founder of abstract painting and M.Čiurlionis, an outstanding Lithuanian cultural figure of the Renaissance type. As part of his philosophical and aesthetic program described in his works “Concerning the Spiritual in Art” (1911) and “Point and Line to Plane” (1926), Kandinsky sought the basic primary elements that underlie any art. He continued the tradition of highlighting and recording such elements in the form of a “vocabulary of aesthetic forms”, which were laid down by the artist P. Signac and the scientist Sh. Henri in France in the nineteenth century [Kandinsky, 1992]. The stage composition “Yellow Sound” (was not shown to the public, 1914), in which the artist realized the idea of counterpoint between visual art and music in an abstract setting, was supposed to show the full force of the tension between the consonance and the contradiction of visible and audible images. Color and line were made free from the static of the picture plane, by moving all the elements in the space of the scene (Figure 2.20) in the form of complex structures, compositional solutions, rhythms, unusual forms and energy content [tsaritsyno website, 2020].
Image Recognition Based on Compositional Schemes 85
Figure 2.21 Yellow-Red-Blue - Wassily Kandinsky. 1925. Oil on canvas 127x200 cm [Kandinsky, 2005].
While working at the Institute of Art Culture in 1920, Wassily Kandinsky distributed among artists a questionnaire with the question: “Do you think it is possible to express any of your feelings graphically, i.e., in a straight or curved line... how do you imagine, for example, a triangle - does it not seem to you that it moves, does it not seem to you more witty than a square; does the feeling of a triangle resemble the feeling of a lemon, what is more like the singing of a canary - a triangle or a circle, what geometric shape is similar to philistinism, talent, good weather, etc.” [Migunov, 2020]. Mikalojus Čiurlionis was a gifted writer, composer, and artist, and in one short period of time he wrote his unique paintings and music art works of the same name. His synesthetic abilities are recognized as a systemic property of “nonverbal artistic thinking, defined by the presence of intermodal associations in it”, which are formed by “the energy of non-objective images in music and related arts” (Figure 2.21) [Loseva, 2018]. The composer Čiurlionis wrote his pictorial sonatas, fugues and symphonies based on constructive and formal features of the musical structure. He used the rhythm of linear-spatial formulas for the development of melodies of many preludes, which he painterly and plastically displayed in the picture plane. Čiurlionis’ works are unique examples in the history of all European culture of the synthesis of music and painting [Loseva, 2018].
86 Recognition and Perception of Images
Mikalojus Čiurlionis, Allegro (Sonata of theStars) 1903
Mikalojus Čiurlionis, Andante (Sonata of theStars) 1903
Mikalojus Čiurlionis, Funeral Symphony (VI)1903
Figure 2.22 Associative works of M. Čiurlionis 1903 [ciurlionis website, 2020].
2.8 Conclusions This paper defined the concept of an artistic image and determined the peculiarities of its appearance and creation based on a graphic drawing. The classification of features of compositional analysis based on shape, location and color has been carried out. On the example of the author’s work, various variants of image fragmentation were considered, according to the definition of a single style for each part of the image. Features of compositional schemes of architectural scenes and landscapes are considered, and some canons in iconography are given. The associative mechanism of analysis based on the synthesis of several areas of art is studied. The classification of these compositional schemes can be applied when recognizing works of art, determining the authorship and style.
References Academic. Dictionaries and encyclopedias. Image. URL: https://dic.acade moc.ru/dic.nsf/enk_philosophy/4605/Образ (accessed on 10.05.2020). Ancient images of the Crucifixion, https://nad-suetoi.livejournal.com/56759.html?thread=1399735 (accessed 02.05.2020) Artistic image. Academic/philosophical dictionaries URL: https://dic.aca demic.ru/dic.nsf/enc_philosophy/4956 (accessed on 10.05.2020) Chekantsev P. A. Reflections on composition in the visual arts/ The teacher of the XXI century. No. 4. 2015. P. 146-152.
Image Recognition Based on Compositional Schemes 87 Chichvarin N. V. Image recognition/Material from the Bauman National Library [Electronic resource]. URL https://ru.bmstu.wiki (accessed on 16.05. 2020) Čiurlionis M. K. The website http://ciurlionis.eu/ru/ (accessed on 11.05.2020) Condorovici R. G., Florea C., Vertan C. Automatically classifying paintings with perceptual inspired descriptors //J. Vis. Commun. Image R. 26. 2015. P. 222– 230. http://dx.doi.org/10.1016/j.jvcir.2014.11.016 Favorsky V. A. About the drawing. About the composition. - Frunze, 1966. 77 p. Gerchuk, Yu. Ya. Fundamentals of artistic grammar: the language and meaning of fine art, teaching guide, Moscow, RIP-Holding Publishing house, 2016, 293 p. Hegel G. V. F. Aesthetics. per. B. G. Stolpner, ed. M. Livshits. in 4 volumes. Moscow: Iskusstvo, 1971. Vol. 3. 384 p. I see sound, I hear color. The website https://soundtimes.ru/muzykalnaya-shkatulka/o-muzyke/ya-vizhu-zvuk-ya-slyshu-tsvet (accessed 07.05.2020) Kandinsky W. W. About the spiritual in art. Moscow, 1992. 208 p. Kandinsky W. W. Point and Line to Plane. Saint Petersburg: Azbuka-klassika, 2005, 232 p. Landscape architecture and green construction. The website Totalarch http://landscape.totalarch.com/ (accessed on 14.05.2020) Losev A. F. The problem of a symbol, and the realistic art. 2nd ed., rev. Moscow: Iskusstvo, 1995, 320 p. Losev A. F. The symbol problem and realistic art [Electronic resource]. URL http://www.odinblago.ru/filosofiya/losev/losev_problema_sim/8 (accessed on 12.05.2020) Loseva S. N. Synaestheticism of musical painting by M. Čiurlionis / Bulletin of Kemerovo State University of Culture and Arts 44. 2018. P. 94-98 / ISSN20781768 https://cyberleninka.ru/article/n/sinestetichnost-muzykalnoy-zhivopisi-m-chyurlyonisa (accessed on 08.05.2020) Loseva S. N. Synesthesia as an interdisciplinary phenomenon/ Bulletin of Kemerovo State University of Culture and Arts. 44. 2018. P. 91-94 / ISSN2078-1768 https:// cyberleninka.ru/article/n/sinesteziya-kak-mezhdistsiplinarnyy-fenomen (accessed on 10.05.2020) Migunov A. C. From synesthesia to synthesis of arts. URL: http://www.kandinsky-art.ru/library/mnogogranniy-mir-kandinskogo13.html (accessed on 04.05.2020) Miller A. A. History of art of all times and peoples. Book 1. Primitive art. Directmedia, 2014: 62 p. Milonov Yu. K.”Technical bases of architectural forms of Ancient Greece”: Vol. 1, Bk 2, Moscow: Publishing house of the all-Union Academy of architecture, 1936. 476 p. Originality of the composition of old Russian icons [Electronic resource]. - URL https://www.lihachev.ru/pic/site/files/fulltext/hudojestv_nasledie/022.pdf (accessed on 10.05.2020)
88 Recognition and Perception of Images Philosophical encyclopedia [Electronic resource]. URL http://www.ciclope dia.ru/100/213/2687031.html (accessed on 15.05.2020) Teplov B. M. Psychology of musical abilities. Moscow: Nauka, 2003. 379 p. Volkov N. N. Composition in painting. Moscow: Iskusstvo, 1978. 383 p. Wang Z. et al., SAS: Painting Detection and Recognition via Smart Art System with Mobile Devices //IEEE Access, 2019, V.7, P.135563-135572. doi:10.1109/ Access.2019.2941239 Yellow sound. The website http://tsaritsyno-museum.ru/events/exhibitions/ p/zhyoltyj-zvuk/ (accessed on 10.05.2020)
3 Sensory and Project Images in the Design Practice Anna A. Kuleshova Head, the Department of Design, Southern Federal University, Academy of Architecture and Arts, Rostov-on-Don, Russia
Abstract
The image is a basis of any piece of art. This research work envisages the nature of artistic images manifestation, the mechanism of sensory perception and creation of the image-bearing solution in the designer project activities. The hierarchy of sensory images is represented and the stages of project image formation are distinguished. The historical examples of symbolization of archetypical foretypes and systematization of images arising in the consciousness of the designer in the course of projecting are analyzed. The relationship between the elements of project image structure is introduced, and the process of creation of the new forms such as environment project images, costume style solutions and conceptual ideas of visual data presentation is illustrated. The image concentrates the designer consciousness and forms the inner world of the personality. The language of the images appears in the copies, associations, prints, stamps, and symbols providing the visual information without a word. It is very symbolic, polysemic and contradictory. It is supported by the archetypical symbols and often connected with the ethnic culture and traditions developed in the different art genres. Keywords: Project image, visual language, image symbolics, conception, symbol, communications, archetype, semantics
3.1 Sensory Image Nature Design occupies an intermediate position between high art, the social consciousness of people, technology and aesthetics. The overall domination Email: [email protected] Iftikhar B. Abbasov (ed) Recognition and Perception of Images: Fundamentals and Applications, (89–124) © 2021 Scrivener Publishing LLC
89
90 Recognition and Perception of Images of the visual image today is manifested not only in the traditional kinds of fine art, theater, cinematography, fashion, etc. The visual perception is actively manifested in the new art practices of design as well as tele-, video and media forms. The achievements and instruments of the visual arts are becoming highly demanded in the sphere of communication, advertisement and design. The projected form of product always reflects the definite contexts and content in the course of designing. The images on television and scenes are becoming the symbol of the epoch of the computer and digital technologies. The absolute influence on the viewer is a main means of expression [Kuleshova et al., 2019]. The objective of this research is to study the nature of images and become familiar with the mechanisms of their perception, interaction and reconstruction in the course of designer project activities. One of the main tasks of this work consists of the comparative historical analysis of establishment of the artistic and project images as the elements of aesthetic environment in the context of development of modern technologies and extension of space for the vital activity of a hyper modernist system. The iconology methods as well as structural and semiological analyses are used to study the artistic practice of the creation of images. The current methods of image identification in the academic fields are based on the regularities of image building, conversion and movement. The different schemes, formulae and equations are cognitive images formed as a result of abstract logical reasoning. The notions relating to the abstract images determine the space, time and dynamic structures of the objects. The design produces many symbols, which become the basis of image language expressing the form sense. The designers provide the personified reality perception at which the images are produced as ideas and reproduced as the new forms of domestic items in the material. The scientific hypotheses and theories are the complicated cognitive images. In addition to presentation of complex processes with the realized images there are problems of interrelation understanding faced in the course of designing of sensory and project images. The affective (emotional) regulation, which features four levels, is the basis of whole mental activity. Instead of dealing with the information in a completely logical and conscious verbal manner, the human brain parallelly uses several levels of perception, which supplement one another, in the decision-making process. The emotions are simultaneously endured either by body or mind. Oftentimes, they are induced means of subsymbolic stimuli such as color, forms, gesticulation or music [Melcer et al., 2015]. The specific visual language arose in contemporary design on the basis of visual art. One of the peculiarities of the current stage of culture
Sensory and Project Images in the Design Practice 91 development consists in the integration and assimilation of the fashion with such kinds of art as architecture, painting, music, theater and cinematography in the form of new ways of demonstration of the author’s creative thinking products, i.e., installations, pop art, environment, performance, etc. [Nagorskikh, 2008]. The informative image content is becoming the important subject of research because not the subject, but the image with the specified purpose is the relevant objective of the utilitarian product today. Nowadays, design is oriented toward the stimulation of customers to buy the great number of products with animal magnetism instead of those with the functional objectives, and this is becoming the design basis. Consumers are motivated to buy the goods that they like. To achieve this goal, the designers bring many different methods to the process of designing to inspire the designers’ thinking. The images board is one of the methods that is typically used by designers. The sensory image-bearing aspects of design are also studied in the architecture to represent the novelty of advanced technologies. However, considering the example of ornament, the argument of designer Robert Levit [Liu Peng-Jyun et al., 2018] emphasizes that the principle of modern architecture features the symbolic aspect to a greater degree than the sensory and functional one [Liu Peng-Jyun et al., 2018]. The contemporary space suggests the new methods of projecting, which are oriented toward the feelings. The main design principles are representations, visualization and modeling of spaces as well as of their boundaries and boundary conditions. The examples of design researches and pedagogy continue the debate relating to the sensual culture and exclusively visual practices of perception of the world in which humans live and for whom the environment is designed. The reality of space features the tactile, thermal and osmatic properties [Passe, 2009]. The aesthetics may be considered as the variation between two emotional stresses, i.e., judgments and subjective sensory reactions on the material goods, either real or imaginary ones. The aesthetic emotional stresses are universal, personified and sensory gunas of humans in the world of infinite being. The knowledge is firstly acquired as the continuous flow of sensual impressions [Warren, 2008]. The question relating to the use of visual mental images and possible interactions between the mental and external sensory representations leads to the development of new methods and strategies in the products design [Marian, Blackwell, 1997]. The image in art is a bright impressing emotion as well as dramatically sincere and, at the same time, the stress-filled reproduction of the most
92 Recognition and Perception of Images diverse events from man’s life. It is possible to point out the scale from the compositional means of rhythm, sound, atmosphere, paints, intonation, graphics, light, lines, plastique, proportionality, range, mise-en-scenes, mimics, montage, camera angle, etc., which suggests the tangible embodiment of the perceived objects and their special nature. There are different types of scales (nominal, ordinal, metric), the application of which allows to build the verbal models of social stratification and processes [Tkach, Khomenko, 2012]. The image sets the energetics of the artistic work. The success and popularity of the design objects also depend on the project image properties solution. The creative process of production of new items of the surrounding world is carried out in terms of such notions as public face and standing, which are the forms of beauty and harmony appearance. The project ideas at their core are the images, which inspire at the beginning of the artistic design and line up the whole nature history being the heart of any piece of art at the end of the project. Any image is an inner world captured by the main stream of consciousness. It possesses, at least, four pairs of opposite forms: 1) emotional – rational; 2) fantasy – object – fact; 3) picture – not picture; 4) whole – part. Such main types of perception as depth perception, direction and speed of movement, time and space perception may be improved through training of the sensory organs. The influence on them results in the reflection of objects of the surrounding world together with their properties and parts. The representations, knowledge and sensations are the heart of perception. They form its structure. The thinking together with other psychological processes of the personality is the essence of perception mechanism. The perception of something that is available in the visible surrounding environment: with the speaking – when it is possible to perceive the observed matter by eyes and call the perceived image; with the feelings – the determinate relation to the thing that is perceived; voluntarily – the form of random organization of the perception process. All sensations are included in the structure of a more complex cognitive process, i.e., perception. The perception is a reflection of entities and phenomena at all as well as their net effect on the sensory organs. The sensations and perception are inextricably connected with one another; all properties of sensations and perception are identical [Abbasov, 2019].
Sensory and Project Images in the Design Practice 93 Perception always unites the complex of feelings and past experience. The depth and completeness of perception depends on the human experience. Moreover, the experience ensures the faster perception of the external information. The fullness and depth of sensations together with the life experience form the acuity of perception of the information about the surrounding world. The regularities of external influences causing the psychic reflection determine the special perception properties: 1. Activity, i.e., the detection of perceived objects; 2. meaningfulness, i.e., reproduction of the holistic view of the entity on the basis of memory and personal experience by means of the only known pattern – smell, sound; 3. generality, i.e., the only perception of primary and secondary patterns; 4. unity, i.e., the perception of the main object in the focus of interest and background when the entity is accentuated by means of dynamics, contrasts and form originality; 5. selectiveness, i.e., the selection of certain objects of interest due to the human factors, preferences and human needs to see only that which is asked. The image perception is manifested on two levels: sensationally emotional and abstract, which is built on the logical operations and corresponds to the analytical perception. The computer technologies introduce considerable changes in human consciousness and needs as well as give incredible opportunities of mass influence on perception. The new ideas, forms, senses, images, technologies and means available at the boundaries of art, engineering and science are searched during the current fashion shows of the famous fashion houses [Kuleshova, Ivanov, 2015]. Such processes as associations, fantasies, elements of guess, intuition, unawareness as well as knowledge, the search of form and content unity are considered when analyzing the regularities of a designer’s mental processes during the creative process. The thinking in the creative process is formed as image thinking and then it is turned out to be the imaginative associative, project, heuristic and creative one, which is considered to be the ultimate in the hierarchy of consciousness types to this date. The representation mechanism is a basis of image mental activity. It performs the following functions: creation, formation, support, transmission, manipulation, modification and reflection of reality. This forms the
94 Recognition and Perception of Images coherent mental picture from the separate elements. The scheme of imaginative thinking is shown in Figure 3.1. The types of imaginative thinking embrace the obvious, visible, spatial and associative thinking. They are shown in Figure 3.2. Imaginative thinking is opposed to imageless thinking. Functions of sensations: 1) perception of the complex cognitive processes; 2) conversion of the external action energy into the act of creation; 3) ensuring of the sensual activity base and providing of sensory “material” to form the psychic images. The imagination is a psychical process of reality reflection and creation of new images for the future through processing of the initial representation. The imagination extends the boundaries of the perceptible matter, lifts the veil of the future and reconstructs history. The imagination
The Imaginative Thinking
The Insight Method
The Reflection of reality
The Imagination Mechanism
Figure 3.1 Scheme of imaginative thinking.
Obvious
Visible
Spatial
Figure 3.2 Types of imaginative thinking.
Associative
Sensory and Project Images in the Design Practice 95 modifies the personal views of the man. It is possible to point out the following types of imagination: • unconscious imagination, which forms the new images unintentionally, with no apparent causes of dream types; • conscious imagination, which actively and forcefully creates the new images through the help; • waking dream, i.e., ephemeral image contextured from the desires and hopes for the future, which is often unachievable and ideal; • creative imagination, i.e., creation of the new images through expressing the personal views and qualities. Stages of creative imagination: • strikening – the birth of idea; • consideration – “thinking over idea”. The character images are created on the basis of interfusion and embodiment carried out by means of resources (word, rhythm, sound intonation, picture, color, light and shadow, linear relations, plastique, proportionality, range, mise-en-scene, mimics, montage, close-up, camera angle, etc). The artistic image is an inseparable interpenetrating unity of objective and subjective, logical and sensual, rational and emotional, mediated and unmediated, abstract and specific, general and individual, necessary and casual, internal and external, whole and part, entity and phenomenon, content and form [Strelchuk, 2011]. Understatement is one of the aspects of image ambiguity. The percipient receives the reference pulse for thinking. He is set out with the emotional state and program to process the received information. However, he reserved free will and scope to the imagination [Pronina, 2015]. The function of the images is to create the world of values, which are varied with the aesthetics of different historical epochs over the course of time. The artistic image acts as the sign and symbol, and may form the particular language, which contains information on the values. The image stylistics allows to make any piece of art actual, individual and memorable. It helps to age it or rush off to the future. The image as a mean inspires the piece of art. The images of perception and reproduction form the system. They may be sensual or conscious by their nature. The first pair of images reflects the reality in the course of perception, and the second pair reproduces the new reality while modifying the real life.
96 Recognition and Perception of Images The sensory images of the first pair of perceptions are built on the sensations, which unconsciously and subjectively copy the objects around. Some of them perceive the reality, and the others, i.e., the sensory (psychic) images, reflect the inner world of personality and perceive the phenomena through the sensory organs. The second pair consists of the images of objective perception of the surrounding space as well as project artistic images. Those sensory images of the man, which are associated with the relevant notions about the things through language, are conscious. The understanding of image as the structure of elements and mechanism of their combination in the course of creation of forms, which function in a novel way (project images, style solutions, conceptual ideas), is of special interest. The image concentrates the consciousness on the main thing and forms the inner world of the personality. Sensations, perception and representations form the hierarchy of the images, which embraces four levels: 1) sensations; 2) image perception; 3) representations; 4) sensations reproduction. The sensations are the simplest form of images sense knowledge.
3.2 Language and Images Symbolics The pictorial structural diagram (Figure 3.3) provides the representation of the nature of image development from the sensory to the project one. The images language, which has appeared before in the copies, associations, prints, stamps, signs and worldlessly, provides the visual information. The thinking images exist in the form of notions, which are materialized into the models of signs interacting verbally with each other. The sense knowledge is acquired in three forms: sensation, perception and reproduction. The simplest form of sensory perception is represented by sensation. The image nature consists in its understanding as the mechanism of realization, conversion and functioning of content and form of the design and project object. The color, form and composition as the elements of artistic work are the sign of symbolic means, but they are not the carriers of personal meaning. The acquisition of art language in the course of perception is
Sensory and Project Images in the Design Practice 97 Sensory cognition
Sensation (reflection of an individual property of an object)
Perception (synthesis of individual sensations, a holistic image of the object) Representation (visual image of the object without direct contact with the object)
Synthesis of previous sensory images
Memory
Figure 3.3 The hierarchy of sensory images.
a transformation of the sensory image into the logical thought, thoughtform, idea or conception. The primary human emotions in different cultures form the various notation systems (e.g., “love, comicality, disgust, astonishment, compassion, horror, heroism, fear, peace of mind and amiability” in the ancient Indian culture) [Borev, 1975]. Seven basic emotional types of images are defined in European history: tragic, comic, sublime, heroic, attractive, touching and dramatic. The historical analysis of visual image systems formed due to the compassion gives an insight to their influence on the affective tones of project images. It is interesting to envisage the experience of makeup of the Beijing Opera representing the character images of the moral virtues and sins. The symbols of Beijing Opera characters and explanation of their visual images are presented in Table 3.1. The graphical solutions of signs of present-day identity based on the artistic images of Beijing Opera characters makeup are shown in Figure 3.4. According to Hegel, the artistic image is a result of “purification” from all causal and those obscuring the essence. The result of its “idealization”
98 Recognition and Perception of Images Table 3.1 Character images of of Beijing Opera characters makeup. Color
Symbolized qualities
Characters of novel Three Kingdoms
Red
Loyalty and courage
Hero Guvanyu
Black
Virtue, singleness, honesty
The impartial and incorruptible judge Baozheng
White
Guile, ingratitude, wickedness (sometimes humor, kindness of a stupid person)
Caocao
Blue
Hardness and durability
Xiahou Dun
Green
Perseverance and nobility
Emperor Tai Dzu
Yellow
Irascibility
Dzi Lao
Gray
Dashing (for aged characters)
Dzhao Gao
Character portrait
(Continued)
Sensory and Project Images in the Design Practice 99 Table 3.1 Character images of of Beijing Opera characters makeup. (Continued) Color
Symbolized qualities
Characters of novel Three Kingdoms
Purple
Wisdom and Courage
Yui Di
Gold
Divine
Zhu Lai
Character portrait
Figure 3.4 Design image based on the symbols.
characterizes the special method in art to reflect and transform the reality [Hegel, 1968]. As the category of design art, the image is a way to express the individuality, perception of the separate phenomena of the surrounding reality in the course of creation of the new functional form from the
100 Recognition and Perception of Images perspective of the author’s individuality. According to philosophy and art history, the artistic image originates the skillfully built messages, i.e., the informational messages, which reflect the author’s reaction to the real outside world, and depth of the author’s personal understanding of universal laws projected in the new senses and symbols. The system analysis of sensory images in the design determines the basis of the project image, which may be represented by the conventional scale from such positive emotional empathy as love, attraction, tenderness, merriment and respect up to the negative tones such as rudeness, grief, annoyance, fear and fury. The emotional qualities intercommunicate, and one stress comprises the flow of emotions and feelings at times. Therefore, the image is always complex. The emotional perception of the visual image may pass the information without a word; that’s why it is a creative part of the project image in the design. The hieratic symbols, which have been preserved over a thousand years practically without changes, may be lined up as the most persistent in the series of form images of the material culture. The symbols are passed down through generations and reproduced in a state of nature at a time when the form of clothes, customs and even traditions have changed. The symbols are produced by the human psyche and form the sign system by means of the absolute unconscious. The generations experience summarized the images of hieratic people views and detected the universal artistic archetypal images. The archaic signs, graphemes, ideographs and runes as well as geometrical, zoo- and anthropomorphic symbols formed the sign system of mythological and religious views. The decryption of sense of the ancient images lies at the core of their religious worldview idea. The main archetypical images are reflected in the ornamental art of the nations. They feature the explicit connections and denote the unity of the Human and Universe through such archetypical images as: –– cosmic egg, i.e., a symbol of the starry sky where the eggshell surrounds the Earth; there is a source of life and Universe inside the egg; –– triune world structure (sky, ground and underworld) and the image of the Tree of Life as the symbol of past, present and progressing future; –– Sun is a symbol of the source of warmth, light, life; –– solar signs (the circle is a symbol of female sky divinity); –– cross is a symbol of “burning” Sun;
Sensory and Project Images in the Design Practice 101 –– the geometrical marks of the Earth such as triangle, square and diamond possessed a magical meaning (the diamond is a symbol of a male sky divinity); –– Earth image is an image of the Mother, heartside, labour, rites and traditions. The archetypical images are associated with the Human, with the world of animals and birds, mythological creatures, etc. The serpent is a permanent image of the ground depth and the underworld (hell) [Gorodetskaya, 2004]. The zoomorphic ornament includes both the real (bird, horse, deer, bull, etc.) and mythological images, which came from legends and myths of the nations (Sirins, Centauri, unicorn, dragon, mermaids, griffon, etc.). The animals were idolized and turned into a cult. The archetypical images were preserved in the ornament motives of the decorative and applied arts, in the décor and interior decoration of the house, etc. The Russian northern hut (izba), traditional garments, painted spinning wheel reflect the world perception of the ancient man. The primary images are turned into the symbols in the cultural forms becoming harmonious on the form and global in content. Due to the graphic form, the archetypes are formed in the consciousness as the archetypical images (e.g., tame fire, chaos, creation, Yang and Yin marriage, alteration of generations, “the Golden Age,” etc.). The set of images forms the system of basic image symbols of the environment, which includes such elements as the images of article, face, character, costume. The costume in the system of basic images is a carrier of coded information on the age, sex, ethnicity, profession, status, historic period of origination, and geographical residence of the owner. The basic images of the costume are represented by the heavy, integral design with the certain established set of major costume characteristics (silhouette, proportions, divisions, rhythm, color). The set of such characteristics is recognized; it conjures the specific basic image. The environment image is created by the language, which possesses its peculiarities (form clarity, figure conditionality and ornamentality). The wavy horizontal line in the Slavonic symbolics denotes the water. The wavy vertical line is a rain. The seeds thrown in the broken ground were drawn in the form of points. The seed points may be seen in the squares and diamonds of the ornament. The Earth was represented in the form of diamonds, quadrangles and squares. The fertility was associated with the image of the mother earth. The woman figure signified the divinity, the bearer of humankind and “mother of heavy crop”. The woman figure,
102 Recognition and Perception of Images as well as the tree, birds, animals and solar signs, is associated with the fertility symbols. The Tree of Life is a symbol of development and endless renewal on the Earth. The image of a horse is one of the most ancient and beloved images of popular art, which is a personification of the sun moving on the sky. The image of a bird is associated with the coming of spring and renewal. The number of examples of real images turning into decorative symbols by means of imagination is sufficient in all branches of the art. The most illustrative mechanism of images reconstruction carried out by means of imagination is represented in the traditional industrial arts, decorative and applied arts. The imagination is a primary element in the designer creativity to express the image of projecting result. As the instrument of solution of the problem situation of uncertain nature, the imagination is a mean that unites the views into the single image.
3.3 Methods of Images Production in Ideas Agglutination is an elementary method to create new images through “bonding” several independent, at first view, parts with the unjoinable properties. Examples include the fairy images and characters such as mermaids, a hut on chicken legs, Frog Princess, Pegasus, Minotaur, dragon Gorynych, etc. Technical inventions such as accordion (combination of piano and button box) and quadcopter (i.e., videocamera on the flying machine) are also created by means of bonding. Moreover, the hyperbole, which turns the item into the new one by means of paradoxical over- and underestimation of its properties and number of its parts, is also the form of transformation (e.g., the multiarmed goddesses in the Hindu mythology, the dragons with seven heads, etc.). The next method is a creation of the imagined image through accentuated underlining of features (e.g., friendly jests and unkind caricatures). The schematization method in the system is based on the views from which the image is constructed and in which the discrepancies are smoothed, the similarity traits come into the picture, and the image is simplified as in ornament on the basis of vegetation forms. The examples of historical simplification of the real primary images up to the traditional symbols of bird, horse, tree, woman and water images, which are widely known motives of embroideries of decorative and applied arts, are shown in Figure 3.5. The examples of use of the agglutination method to create the new form of modern quadcopter and the fantastic image of the winged mythological horse Pegasus are shown in Figure 3.6.
Sensory and Project Images in the Design Practice 103
Figure 3.5 Real images are converted to decorative symbols.
104 Recognition and Perception of Images
Figure 3.6 The agglutination method in creation and imagination.
The synthesis of views in the imagination carried out by means of typification defines the essential views repeated in the similar facts and their embodiment in the particular image. The provenance of the artistic image in the physical world also sets the mechanism of creation of new designer forms of the costume due to the expression of artistic images. Nowadays, the requests and needs of people are constantly changing; that’s why design produces the new images, which in their turn, are changed under the influence of fashion trends and social events. The fundamental principle of the artistic image in art is represented by the archetypes as well as the function and invention in design. In the course of artistic projecting of clothes design pieces, the conceptual search of solutions often uses the basic images as the general-purpose integral systems or “mixes” several images for the uprising of new ideas. The mechanism of affection and stimulation to the particular perception and behavior by means of projected visual images are actively used in the current advertisement and communications. The emotional projections of the live-streamed images subordinate the development of other perception aspects. The system of image building embraces the agglutination, hyperbole, accentuation, schematization and typification. The identification of artistic image development patterns in design: principles, tendencies and interchange rhythm. The understanding of the project image in design is built on the analysis of style and imaginative semantic approaches to the creation of objects in the setting of temporary variations. The artistic image forms the impression on the beauty. The image includes both total and individual, mass and original. The image is more global notion than the sign notion in contrast to which the image represents and transmits the physical forms of life.
Sensory and Project Images in the Design Practice 105 The key of reality transformation is a fiction, which develops the new image of creation by means of feeling, intuition, imagination, logic, fantasy and thought exceeding the limits of coping. The visual and verbal reproduction of imagery simultaneously creates the new reality in a great number of artistic image manifestations for the competitive time and space. The process of image creation features the general characteristics, reflection forms as well as emotionally imaginative content. Despite the particular perception of image in the different branches of art, the image nature and its mechanism of creation feature the same forms of reflection, reproduction and emotional/thinking content. The reflection of architectural objects manifests the image content through the sign peculiar to the certain culture and associative representations set by the object structure. The artistic image of utilitarian products gives a meaning in the figurative form. The ambiguity and understatement of images of functional entity forms are evolved in time while bringing new images and senses. There are two known methods in art which form the new images: 1) metonymy, i.e., the artistically associative transmission of information implying the transfer of meaning from the whole to its parts while making such parts “animated” or “humanized” (e.g., seven-league boots as they cannot walk by themselves); 2) metaphor, i.e., the artistic device of the vivid impression from the entities between which the nonexistent naturally relation and similarity are set and extrinsic possibilities are assigned. For example, it is impossible for water to “whet” the stone. However, it is possible to represent it metaphorically. Moreover, it is possible to represent the designer ideas of household devices with two or more images becoming one. The current devices may feature the technical image of the certain function and metaphorical image in which a fly, jumping animal, color-shifting plant or other property of the natural form is guessed right. The artistic image is a result and ideal form of reflection of entities and phenomena of the physical world. It exceeds the limits of human consciousness while turning into the project solution of the objective-spatial environment. The modeling of external shape of objects of human life environment with the help of metaphor creates emotionally comfortable conditions. The imaginative associative relations build the series of images
106 Recognition and Perception of Images and animate the space. The thing is animated through the system of artistic images. The cultural and social pieces shall comply with the typology of artistic image application (functional, social, ideological, culturological, constructive, regional). For example, the ideological characteristics and regional specifics are peculiar to the mythological image. Mythology is based on the manifestation of the nature powers, on their representation in the form of sensory images and special creatures (people, animals, gods). The basic recognizable elements, which ensure the originality, likeness or repetition of the structure, are laid in every image.
3.4 Personality Image Projecting The field of designer creative work comprises the tasks on the creation of stage image or theatrical character as well as the individual personality image. Up to the present, the types of individual personality image may be formed with the help of examples of stylistic models development for the real time and space. The “trying of images of the known style icons’’ as well as the fitting of human physical characteristics and the manner to move, speak and wear became the meaning of many visual personalized images as design objects of the 20th and 21st centuries. In the course of digitalization the augmented reality with the visual images created for a number of reasons and motives begin to be increasingly demanded. For example, when teenagers decide to commit to a certain subculture group, they begin to accept themselves in new ways. They need to change their own image. Designers often call out to the images of their youth (e.g., Gosha Rubchinsky praised the aesthetics of the 1990s). The designer is unaffected by the accusations relating to the application of primitive street style because the personal history and emotions of youth are always behind the simple silhouettes [Kokuashvili, Bakina, 2017]. The other example is represented by the stage images. It is possible to point out several recognizable basic elements in the image structure of Beijing Opera actor characters – moustaches, beard, eyes, specifically applied marks of makeup, etc. The Beijing Opera character with the elements of traditional makeup necessary to create the stage image (character) is shown in Figure 3.7. The makeup of Beijing Opera is a source of inspiration and borrowing of the certain senses, ideas and meanings in the course of creation of costume form images and graphical objects. The communication method
Sensory and Project Images in the Design Practice 107
Figure 3.7 The image elements of the makeup for Beijing Opera characters.
suggesting the transition to the special language to communicate by means of the face painting symbols is an active means of advertising technology and an opportunity to extend the assortment of unique graphical products. The immortal images of mythological themes in art were often used to protect from evil. Moreover, they inspired the creation of new pieces of art (e.g., they attracted the Symbolist painters of the late 19th century). The anthropomorphic image of items features the structure and parameters consistent with the man and his capabilities. The ideological image determined the social status of the item and its characteristics. The zoomorphic and floromorphic forms of items express the images of the region as well as its plant and animal world. The functional and constructive images possess the items, which reflect the social events and characteristics of the economic order. The culturological image includes reference to the character types of the past. The content of the artistic image has the sense characteristics (semantics as well as qualitative and quantitative characteristics with the distinguished elements, properties and relations). Two types of semantic relation determine the sensory and thinking images. The understanding of senses or the cultural communication between the object and consumer is carried out by means of images. This
108 Recognition and Perception of Images is a sense of designer approach to projecting. The images differ due to the area of project activity. The translation of visual images into the language of word signs and social values within the semantic meaning of clothes, advertisement, photography and cinema is united with the notion “image”, which was most commonly used by R. Bart to call such socialized communicative objects as the visual image functioning in the process of communication [Bart, 2003]. The images of environment, costume and communicative forms feature the peculiarities both at the stage of perception and at all further stages of imagination, representations and thoughtforms. Nowadays, many consumers have an opportunity to follow the tendencies in the area of design, and they look for their style and way to express themselves. The image specific character also differs by its visual forms at the stage of conversion. However, the structure of the artistic image is built according to the unified principles and corresponds to the creativity source. It is possible to distinguish the following in the artistic image: –– –– –– ––
events images; circumstance images; conflict images; detail images.
3.5 Project Image The converse relations between the images perceived by the designer and one displayed by the reproduced object are noticeable in the structure of project image. The mechanism of human psychic transformation processes the external influences and uses their substitutes in the form of models and things. The information carrying the functional meaning corresponds to the structure of project models and stands out in the form of image for designer. The notion of “ideal image” relates to the functional property of the designer activity in spite of the idealistic idea about the nature of image as the spiritual substance. However, the ideality as the property to exist in the idea as well as unity and meaningfulness are some of the main characteristics of the project image. The study of material conditions and mechanisms of feelings manifestation allows the understanding of the forms and stages of formation of project images in the field of fashion and costume. Considering fashion as the idea about the popularity of clothing styles it is possible to point out the conceptual thinking and predictive ideas except of mathematical
Sensory and Project Images in the Design Practice 109 models and scientific approaches. Many ideas are associated with fashion as conception and as the image of preference and understanding of the popularity of a thing or differential characteristics set (style). Since its origination, costume performs not only utilitarian and aesthetic functions. The main function of costume as the information carrier consists in the sign semantic complex of identification of individuals in the social hierarchy. For example, Riccardo Tisci appreciated the couture tradition of Givenchy Fashion House. He successfully gave life to the images of the Fashion House while simultaneously emphasizing the personal qualities of the designer, rethinking the designer individuality and trying to underline the new character of the house. The original designer individuality of Givenchy Fashion House is based on the timeless and contemporary romanticism. The style is manifested as the system of recognizable and enduring images commonly used in the environment, which was formed under the influence of associative imaginative thinking. The images as the primary elements of style fill the models with the definite aesthetic ideas and feelings on the basis of the image associative relations of the designer creative thinking. Just associative imaginative thinking determined the human capability to create the new as well as generate new ideas and concepts. The image in the project activities structure is shown in Figure 3.8. The concept is a fusion of the image and method multiplied by the novelty. The image is originated in the project sketches, and the method is a conclusion from the logically relevant texts and principles. The image is
Idea
Art
Science
Image
Method
Concepts
Figure 3.8 The image in the project activities structure.
110 Recognition and Perception of Images a result of reflection both in the process of perception and creativity. The image of thoughts regarding knowledge is determined in different ways: worldview, world outlook, views, world perception, view of life and outlook on life. In the artistic activities, the image is defined as the universal category of creative work, the form of world understanding realized by means of ideas and objects possessing aesthetic value. The image in the sensual presentation is represented by feelings, perception and views. It is always associated with such concrete terms as “statement”, “message”, “knowledge” and “information”. The image consists of the objective (real) and subjective (ideal) within the abstract thinking. The project image in the design features the functional and emotional content. The functional content is set by scientific knowledge and advanced technologies. The emotional content of the project image is found in the synthesis of designer personal traits as well as spiritual, psychological and social factors. The project image in the environment system is formed according to several ways. The image associative method models the artistic image-making element. The method of composition formation of the product creates the image by means of artistic expressions (metaphoricity, associativity, paradoxality, etc.). The meaning-making method forms two parameters of the project image – idea and form resulted from the harmonic organization of the artistic expressions [Tereshchenko, 2016]. The project image contains three main parts – objective reality, subjective designer’s world and life experience of the consumer to whom the message is addressed. The image in design is responsible for the sign function of the project product. The project image features the objective, semantic as well as emotionally expressive meaning. The imaginative content of projected design objects is predetermined by the human commitment to the ideal world. Contemporary designers use the following methods to create the imaginative models in the process of projecting: –– –– –– –– ––
borrowing image; analogy image; association image; citation image; stylization image [Tereshchenko, 2016].
The advertising image is not only the form to present the information on the product and the device to promote the products and ideas today, but also the opportunity to broadcast the myths, norms, values and images of
Sensory and Project Images in the Design Practice 111 the contemporary culture. The advertisement has a great influence on the public consciousness as it creates the image among the crowds of people. The structure of the artistic image of the cultural and social design object is determined by the sense of its substantial part. The images system in which the signs and symbols are marked out as the environment- forming factors, that create the aesthetic environment, become the basis of plan, conception of sketch searches and final form of products. The images put in the signs, items, archetypical codes, which determine the semantic meaning of the artistic image in design, define the law of development of the artistic image expression forms. The different emotionally ideological forms of the image existence are critical in the process of projecting, when the various consciousness levels (feelings, intuition, imagination, logics, fantasy, thoughts and reality) are active. The historical context of cooperation relating to the development of style and certain types of artistic image gives a boost to the arising of construction image in the utilitarian products and to the creation of a comfortable environment with the functional image. The imaginative content of the design products forms the environment, which determines the human being and reflects the emotions peculiar to the specific time. The emotional character of design products is given by the industrial, living and climatic conditions as well as the perception of surrounding people. The environment image is constantly changing in the setting of development of technologies, equipment, materials, etc. The search for project solutions for the contemporary image of object environment is developed under the influence of a wide range of conceptions and visual metaphors available in the design. The information support today promotes the rapid propagation of ideas, technologies and origination of new images [Chepurova, 2004]. The human as the personality-individuality and artistic imaginative form of world exploration is an indication of the current state of design project culture. The perception of the object through its senses allows the projection of the cultural and social object as well as some sign and semantic model. The number of style directions was formed on the basis of understanding of the imaginative content of historical household goods and impressions as well as perception of different components of the environmental system. The structure as a form of existence of the artistic image determines the stable relations of parts with the integral form. The sign-oriented communicative meaningfulness of the utilitarian goods possesses the specific semantics. The example of image building on the basis of semantic content of creative source is shown in Figure 3.9. The specific structure of the electric tower is expressed in the content of brandnew costume form.
112 Recognition and Perception of Images
Figure 3.9 An electric tower is a creative source for the image of brand-new costume.
Nowadays, the visual image also dominates in the traditional fine arts, theater, cinematography, fashion, etc. The visual perception is strongly sought in the new art practices of design, video and media forms. The field of communications, advertisement and design is supported by the latest results relating to the visual arts. Interest in entertainment is one of the main human needs. The fashion industry reproduces the functions of the visual arts being the part of the world culture. The fashion products are represented by the model images demonstrated during the presentation of fashion collections. In the course of projecting of different design objects (e.g., costume), the priorities are shifted and the term “projecting of specified purpose product” is replaced with the term “projecting of specified purpose image” that carries the informative content. The process of formation of project image may be represented by means of three stages – perception, formation and reflection. The artistic image characterizes the method to reflect the reality and its transformations. The image as the category of designer creativity is a method to express the individual perception of the separate phenomena of the surrounding reality in the course of creation of new functional form. The pictorial structural diagram (Figure 3.10) represents the idea about the specifics of image development – from the sensory to the project one. The influence
Sensory and Project Images in the Design Practice 113 Stage of perception
The psychological and sensual image
Sensation
Thought image
Desire - dream
Artistic image
Formation Stage
ImpressionPerception
Brainchild
Imagination
Design image
Stage of reflection Idea
Form
Worldview Concept
1 stage
2 stage
3 stage
Figure 3.10 Stages of creating a project image.
of associative imaginative capabilities on the creative thinking of the designer is determined. However, the associative imaginative thinking defines the human capability of creating the new as well as generating new ideas. The characteristics of the project image embrace the ideality (capacity to exist as idea), unity and meaningfulness [Chepurova, 2004]. It is possible to point out four groups of images according to the type of aesthetic ideas: –– –– –– ––
event images; circumstance images; conflict images; detail images.
The design projecting includes three methods of the imaginative approach: 1) the artistic modeling as the embodiment of ideal properties of the product by means of imagination;
114 Recognition and Perception of Images 2) composition forming is a building of relations between the elements of form to obtain the harmonious whole; 3) meaning-making is providing a thing with the content from the perspective of social and cultural positions. The project image may be characterized by such criteria as the ideality (understanding of possibilities to exist as idea), unity and meaningfulness. The image consists of the knowledge and experience of one human as well as the experience of previous generations. The visual picture images and foretypes exist as copies, prints and stamps of some visual information. Moreover, there are more complex images – “non-picture”. They may be divided into two groups: 1) objective images, which exist individually and independently of the human mind (image of planets, sun, etc.); 2) psychic images, which reflect and reproduce the entities and phenomena by means of sensory organs as well as reconstruct new objects from their combinations. Visual thinking is based on the operations of synthesis-analysis, concretization-abstraction and association-dissociation of the artistic images as well as on the sign-oriented symbolic system of such means as color, form, composition. All elements of the artistic work do not contain the personal meanings, but show up their image-bearing expression. The language of art turns the sensory image into logical thought in the course of perception selectiveness and purposeful nature [Surina, 2006]. The designer produces the idea, which is particularly or collectively turned into the artistic image, in the course of life activities and under the influence of diverse information and events. The idea image is transformed into the form and corresponds to the project. The result of brainchild of the Egyptian goffer images into the artistic image of modern costume is shown in Figure 3.11. To provide the visual message to the audience, the designer is on the hunt for the instruments, which carry his main idea and author’s intention as well as create the artistic image of the model with the help of their signs and symbols [Kuleshova et al., 2018]. The turning of the artistic images of Egyptian pharaohs into the modern project image of the costume is shown in Figure 3.11. In the course of design projecting the principles of Japan aesthetics play a special role for the project culture of costume. The traditional principle wabi is expressed by the following artistic devices and beauty kinds: charm of the simple cut and costume decoration; the purity and modesty of the
Sensory and Project Images in the Design Practice 115
Figure 3.11 Stages of creating for costume design.
image and costume presentation, ascetic elegance and conscious primitivism; poorness, jejunity and obscure quality of the fabrics, surface texture and color choice of the costume. The beauty of imperfection, asymmetry of silhouettes and forms are represented in the collections of costumes of the Japan designer Kawakubo Rei, 1977 [Vasilisko, 2016]. The avant-garde projecting conception allows to find out the perspective fabrics for clothes, ways to deal with them and innovative technologies of formation as well as organize the expressive architectonics of the modern costume [Kuleshova, Ivanov, 2015]. According to Jean Baudrillard’s conception, fashion is considered as the fairy play of codes and as the universal form in which the different signs are mutually exchanged. Nowadays, the propagation of functionally demanded images is actively observed in the society. The image may be considered as the ready instrument to obtain the efficient sending of information and message. Art semiotics is a theory of sign systems. The relations between the important elements used for the projecting of the artistic image allow to form the available project instruments to create the final image of the popular fashion collection. The artistic image as the system object of projecting must be constructed according to the set project tasks. The image and aesthetic value serve as the properties, which define the designer object with due regard to the work of art. The imagery is commonly achieved through oppositions mixing. In this situation, such basic aesthetic categories as the Beautiful and the Ugly come in sight. These categories characterize the phenomena from the perspective of their conformance and nonconformance to the aesthetic ideals established in the society. The images of beautiful are expressed in the harmony, and the ugly manifests the anti-aesthetics. The combination of these opposed conceptions and imagery vivifies the object imagery [Chepurova, 2004]. The photos of world famous images (sex – the symbol of the epoch of the 1960s Marilyn Monroe, the glorious comedian of the silent movies Charlie
116 Recognition and Perception of Images Chaplin, the great mystifier and surrealist painter Salvador Dali) of the 20th century representing the celebrated personalities are shown in Figure 3.12. The artistic stage image is a criterion of art. It realizes and reveals all peculiarities. The availability of image distinguishes art from crafts, and the artistic image features the following distinctive characteristics: –– emotionality (the image must bring out a variety of emotions and feelings); –– distinctiveness (the image always reflects the peculiarities of the inner world of its creator at the moment of projecting); –– open attitude to facts (the image may be free of documental accuracy due to the changeable public sentiment to the same facts with the passage of time. The painter may back out from the certain facts to express the deeper sense). The structure of development of the stage image includes the two- dimensional construction used for the stage layout and detection of the artistic (fashion and stage) image. The emotional impact on the viewers results from the influence of stage play plot, character dialog and the artistic stage image of the characters. In the course of theater origination, the acting on stage and scenic set of the spectacle were functional. Only the acting and different components of the stage business cannot create the complex imaginative solution for the stage production as defined.
Figure 3.12 The world-famous images of Charlie Chaplin, Marilyn Monroe and Salvador Dali.
Sensory and Project Images in the Design Practice 117 For all their variety and similarity, the artistic images differ from each other by the ratio of sensual base and the rational: –– the sensual specificity is better reflected in the portrait images; –– the cognitive base is active in the symbolic image; –– harmony of all elements in the realistic image. Computer technologies introduce significant changes in the consciousness and needs of people and open the incredible opportunities for mass influence on the perception. The new ideas, forms, senses, images, technologies and means available at the boundaries of art, engineering and science are searched during the current fashion shows of the famous fashion houses. The artistic image of the costume in design is a harmonic system of the man, costume and environment. Such project image is created by three ways: 1) The manifestation of idea in the form, texture and color corresponding to the certain theme; 2) Building of the single image of the man and costume; 3) Creation of the visual environment for fashion shows, which complete the image of models [Kuleshova et al., 2018]. The system of images begins wherever the images build the hierarchy in the course of images interaction. The hierarchy is a multistage chain of commands and domination that exists in any sphere of human existence. It may be ununderstandable, complicated and controversial. The image system of the artistic work and actions may be lined up in the following order: –– the total image of general work; –– the image of main character of spectacle, film or models on runways; –– the images of supporting characters in the course of their interaction with the main characters; –– incidental characters; –– the system of images of the artistic space and time (chronotope); –– detail images of all aspects of the artistic activity; –– micro-images. This scheme may be modified and corrected due to the type of created piece of art. The image of the artistic work relates to its genre. The latter
118 Recognition and Perception of Images creates the impression image, which arises to viewer after the observing the certain artistic work. This impression may be insufficiently determined, positive or negative. Thus, the artistic image is a system of thoughts and feelings featuring the processed information form. The image of dramatic arts is an idea in the form of the artistic representation, embodiment of the semantic processing of learnt information and author’s aesthetic experience. As a result, it develops the corporate style and character. The stage images are self- sufficient and perceived as the really existing objects. At the same time, the subjects turn into the role models. The variety of types of the artistic images is caused by the internal laws of development and the used material of the oral, musical, plastic, architectural and other types of the art. In the course of projecting, the images differ from each other by the proportion of the sensual and rational aspects: –– the sensual specificity prevails in the portrait image; –– the cognitive base dominates in the symbolic image; –– the harmonic combination of all elements in the realistic image. The differences of the artistic images of the synthetic art forms are seen in the nature of their embodiment. Many persuasions, patterns, motives and senses are within the area of the unconscious. The world around the human is contextured from signs, and the sign system forms the persuasions and attitudes. Thousands of messages affect humans and capture the attention of consumers each and every day. Any brand, phenomena and object may be very easily lost and forgotten in such noise. Therefore, it is important today to strive for your own identity, the identity of your corporation and company. The cultural patterns determine the imagery relating to the complex of conscious stereotypes. At the moment, the interaction of the theater, cinema and fashion determines the visualization of outer constituent of the personality of the modern human. The perception of the surrounding world, social and creative life of the society is reflected in the sign system of the artistic image of the theater, cinematography and fashion. Its process of formation is progressed in the same way. The structure of formation of the artistic image in the synthetic art forms is shown in Table 3.2. The fundamental principle of the image in design is represented by function and invention. The general building structure of the artistic image of utilitarian object in the environment may be presented like a kind of typological
Sensory and Project Images in the Design Practice 119 Table 3.2 Structure of formation of the artistic image in the synthetic art forms. Type of dramatic art
Theater
Cinematography
Fashion
Artistic image carrier
Typecast
Character
Model of fashion costume
Historical and fashion space
Stage
Screen
Runway
Analog elements of the structure of different types of space-temporal artistic work
Mise en scene
Shot breakdown
Collection sketch
Analog products in the system of arts, where image is a basis, and its creation is a main artistic task
Spectacle
Film
Fashion show
“pyramid” of impressions based on the numerous singular percipiences of the most diverse components of the environmental system [Chepurova, 2004]. The ways of images existing in the current world are developed. Just visual forms of communication and virtual image existence become pending today. The buyer images are increasingly used in the virtual try-in, and the completed reality embodies the desired images in the real examples of purchases. The probability of individuality turning into the different images and “image fitting” becomes the near perspective being realized in many world cities. It is becoming a reality that images deeply hidden in the human consciousness and inner psychic and emotional state are increasingly exposed due to the various technologies and computer programs. Augmented reality and digitalization provides the opportunity to point out the visual images of the human, and put them in the computer and screen space as well as reach the emotional perception and response to the real world. The human may simply “try the images” being reflected in the memory mirrors. On the way toward the separation of image from consciousness, the designer faces a number of problems. However, even today the world of visual forms may communicate both with computer programs and people.
120 Recognition and Perception of Images The augmented reality becomes the instrument that offers the adding of entities and phenomena of the real world with the digital virtual images. This provides an opportunity to visually overlay some digital elements directly on the visible usual and ordinary world. The selected clothes and accessories are united for “image” for further trying on the formed body model according to the image of real prototype. The image in the virtual environment is an impression which is created by the user in the minds of other people. And this is not the sum of its real personal qualities, but the name card of the user created with the help of such means as personification (nicknames, account pictures, user pics, change of color and font, etc.) and depersonification (style of communication, language). The proper approach to the formation of the own image is a ticket to successful communication.
3.6 Conclusion Any human image is created with due regard to such moments as appearance, mimics, intonation, ways of communication, etc. The appearance or visual “edits” are anything and everything that tells about the user before he starts to communicate. The mimic image or graphical “edits” are represented by the smiles, animations, change of font and color. The mimics and look reflect the feelings of the human soul [Yegorova, 2012]. The mental image is the world view, principles, ethical attitudes and social stereotypes of the user. The mental image may include two components: communicative (desire and ability to communicate, knowing of etiquette rules and possession of etiquette skills) and moral (something that is told about yourself and really done). The background image is information about the user received from certain sources. The sources may be of two types: far and close [Yegorova, 2012]. The experience consisting in the study of processes of images composing into the systems of signs and codes, as well as the results of experimental projecting of new models and forms, discovers a number of opportunities for their conversion at the stage of perception. The methods of images reflection and reproduction provide the opportunity to make the following conclusions: –– the language of contemporary society images is a multiform system of historically formed senses, codes and symbols that is a basis of the designer creative work ensuring its development strategy within the framework of community demands;
Sensory and Project Images in the Design Practice 121 –– the concept “image” in design broadens the borders of relevance of its functions; it is increasingly manifested in the various types, ideas of new forms of utilitarian purpose items used in the inventive area, artistic fields of art, in the communicative area and in the field of scientific knowledge; –– the understanding of structure and methods to impact the mechanism of creation of project images forms the system insight into the opportunities and perspectives of the further existence and development of images in the virtual environment of the augmented reality in the context of formation of aesthetic values and ideals of new communicative environment.
References Abbasov I.B. Psychology of Visual Perception. Amazon Digital Services LLC, 2019. 141p. ASIN: B07MPXKLQ7 [Electronic resource] Bart R. Sistema mody. Stati po semiotike kultury [Fashion system. Articles on the culture semiotics] / Translated from French, prolusion and content by S.N. Zenkina. М.: Sabashnikov Publishing House, 2003. 512 p. Boriev Yu. B. Estetika [Aesthetics]. М.: Politizdat, 1975. 399 p. Chepurova O.B. Khudozhestvennyi obraz v dizain proyektirovanii obiektov kulturno-bytovoi sredy [Artistic image in the design projecting of objects of the cultural and social environment]: abstract of dissertation of art history PhD: 17.00.06 / All-Russian Research Institute of Technical Aesthetics. Moscow, 2004. 26 p. Gorodetskaya S.V. Arkhetipicheskiye obrazy v ornamentalnom iskusstve naro dov mira [The archetypical images in the ornamental art of the nations] // Modern Natural Sciences Advances. 2004. No. 9. P. 31-32; URL: http://natural-sciences.ru/ru/article/view?id=13363 Ivanov D.N., Kuleshova A.A. Eksperiment formoobrazovaniya sovremennogo kostiuma [The experiment of modern costume forming] // International Students Scientific Bulletin. 2015. No. 4. P.4-8. Kokuashvili N.B., Bakina M.A. Prichiny populiarnosti postsovetskoi estetiki v sovremennoi mode [Reasons of popularity of the post soviet aesthetics in the current fashion] Collected volume: Fashion and design: historical experience – new technologies, Proceedings of XXth International Scientific Conference. Eds. N.M. Kalashnikova. 2017. P. 345-348. Kuleshova A.A., Khachaturova Ye. A., Mitrokhina T.A. Aktualniye tendentsii prezentatsii kollektsiy dizainerov kostiuma [The current tendencies of presentation
122 Recognition and Perception of Images of designers costume collections] // Terra Humana. 2018 No.3(48). ScientificTheoretical Journal. P. 68-74. Liu Peng-Jyun & Chuang Ming-Chuen. Summarizing the image adjectives for the construction of the picture database for lifestyle. 2018, The Sixth International Multi-Conference on Engineering and Technology Innovation 2017, V.169, 8p. https://doi.org/10.1051/matecconf/201816901025 Marian P., Blackwell A. A glimpse of expert programmers’ mental imagery. Proceedings of the 7th Workshop on Empirical Studies of Programmers. 1997. ESP ‘97, P.109-123. https://doi.org/10.1145/266399.266409 Melcer E., Isbister K. CSEI: The Constructive Sensual Evaluation Instrument. . 2015, https://doi:10.13140/RG.2.1.2917.9045 Nagorskikh T.N. Evoliutsiya fenomena mody v kontekste kultury postmoderna [Evolution of fashion phenomenon in the context of post-modernity culture] // Materials of SPbSUC. 2008, V. 177. P. 93-97. Passe U. Designing Sensual Spaces: Integration of Spatial Flows Beyond the Visual// Design Principles and Practices: An International Journal, 2009. V.3, P.31–46. Pronina N.K. Osobennosti khudozhestvennogo obraza kak osnovy kompozitsionnoi tselosnosti izobrazheniya [Peculiarities of building of the artistic image as the basis of compositional unity of the picture] // Bulletin of Omsk State Pedagogical University. Humanities Research. 2015. No. 4 (8), P. 84-88. Strelchuk Ye. N. Rol khudozhestvennoi literatury v formirovanii i razvitii russkoi rechevoi kultury inostrannykh studentov [The role of literature in the forming and improving of the Russian speech culture of international students] // Review of Samara Research Center of the Russian Academy of Sciences, V.13, No. 2-5, 2011, P.1135-1139. Surina L.B. Filosofskiye podhody k ponimaniyu suti tvorcheskogo protsessa [The philosophical approaches to creative process insight] // Human. Sport. Medicine. 2006. No.16 (71). URL: https://cyberleninka.ru/article/n/filosof skie-podhody-k-ponimaniyu-suti-tvorcheskogo-protsessa (access date: 25.05.2020). Tereshchenko G.F. Evoliutsiya khudozhestvennogo obraza v dizaine interiera [Evolution of the artistic image in the interior design]// Cultural Life of the Russian South. 2016. No. 3. URL: https://cyberleninka.ru/article/n/ evolyutsiya- hudozhestvennogo-obraza-v-dizayne-interiera (access date: 25.05.2020). Tkach A.V., Khomenko Yu. A. Shkalirovaniye v sotsiologii kak metodologicheskaya osnova stratifikatsii [Scaling in sociology as the methodological basis of stratification] // Current Issues of Science and Education. 2012. No. 4; URL: http:// www.science-education.ru/104-6601 access date: 06.05.2020 Vasilisko D.I. Printsipy yaponskoi estetiki v proyektnoi kulture kostiuma. [Principles of Japan aesthetics in the project culture of costume], Collected volume: Fashion and design: historical experience – new technologies: Proceedings
Sensory and Project Images in the Design Practice 123 of XIXth International Scientific Conference / eds. N. M. Kalashnikova. SPb.: FSFEIHPE SPbSUITD, 2016. P.311-316. Warren S. Empirical Challenges in Organizational Aesthetics Research: Towards a Sensual Methodology. Organization Studies, V.29, №4, P.559-580 https://doi. org/10.1177/0170840607083104 Yegorova V.I. Obraz v virtualnoi srede [Image in the virtual environment] // Fundamental researches. 2012. No. 9-4. P. 956-960.
4 Associative Perception of Conceptual Models of Exhibition Spaces Olga P. Medvedeva Southern Federal University, Academy of Architecture and Art, Rostov-on-Don, Russia
Abstract
This chapter is devoted to the associative perception of conceptual models and comprises two parts. The first part of the chapter considers associative modeling of environmental spaces, and the second part describes modeling of objects in exhibition spaces with the psychology of perception taken into account. The process of designing models of Russian pavilions for international exhibitions is described; the conceptual, compositional and figurative features of objects are revealed. A range of search figurative models based on associative feelings are presented. In the process of forming spatial environmental objects, the method of associative, figurative, conceptual modeling, scenario planning and compositional formation of the space and objects of exhibition spaces has been applied. The options of associative three-dimensional models of exhibition space objects are proposed. Keywords: Association, associative-figurative model, perception, concept, prototype, shaping, exhibition, international exhibition of scientific and technical contributions
4.1 Associative Modeling of the Exhibition Space Environment 4.1.1 Introduction Since as early as the middle of the 19th century, due to the dynamic growth of science, technology and industry, a sufficient arsenal of crucial inventions Email: [email protected] Iftikhar B. Abbasov (ed.) Recognition and Perception of Images: Fundamentals and Applications, (125–142) © 2021 Scrivener Publishing LLC
125
126 Recognition and Perception of Images and discoveries has been accumulated. The leading representatives of many countries faced the need to demonstrate their contributions in the field of scientific, technical and artistic design. It has created the demand to build up platforms and sites for these contributions to be displayed. The opportunity to present new products at industrial exhibitions of scientific and technical contributions both at national and international level has become extremely popular. “The role of exhibitions in the modern world cannot be underestimated today. Each exhibition serves as a powerful catalyst for economic growth in the chosen host city. That is why there is a serious competition for the right to hold this historic event every five years” [Medvedeva, Boyko, 2017], [Boyko, 2018]. EXPO World Universal Exhibitions occupy a special place, accumulating various contributions from the whole world in one place and striving to interchange useful experience on a global level. These conditions meet the needs of any participating country to get a chance to fully describe its contributions, discoveries and prospects. Relevance of the research. EXPO World Exhibition is the most authoritative global platform for fair and open competition of the exhibiting countries. A significant feature is that visitors to the exhibition are represented by a universal audience, including almost all professions, and all age and social groups of the population. Each participating country has an urgent need for high-quality positioning of its contributions. This creates the need to build up such a contemporary and bright figure of the national pavilion that will impress visitors and make them remember it. The relevance of the research is supported by the fact that apart from being the indicator of the development level of the society in the participating country, the national pavilions of international exhibitions are the brightest example of positioning the artistic preferences along with the compositional, associative and figurative as well as conceptual solutions of environmental objects. Problem statement. The history of the formation of world exhibitions as a cultural phenomenon is reflected in the works of domestic and foreign experts devoted to the study of the theory and methodology of architectural design and environmental design; compositional, coloristic problems of the environment; semiotic research, as well as research on the psychology of perception of the conceptual figure of national pavilions. To confirm the hypothetical assumption that the perception of objects of international exhibitions is most effective when using the methods of conceptual and scenario design and associative modeling, the following tasks were set: based on the analysis of domestic and foreign experience
Associative Perception of Conceptual Models 127 of organizing the environment and objects of international exhibitions, to offer the options for conceptual models of organizing the environmental space and models of national pavilions of Russia at international exhibitions.
4.1.2 Conceptual and Terminological Apparatus of Conceptual Modeling and Shaping A modern exhibition is not only a public event, a representation of contributions in various areas of public life, but also the very venue of this event. The concept of world exhibition in the Great Soviet Encyclopedia is defined as international exhibitions organized to show the diverse activities of peoples in the field of economics, science, technology, culture and art [Prokhorov, 1971], [Bollini, Borsotti, 2016]. A characteristic feature of the modern international exhibition is the dialogue of cultures, which contributes to the generation of new meanings. When designing the environment and exhibition objects, the task is to take into account and meet the needs of hundreds of thousands of visitors. The term concept is used “to denote the leading idea, constructive principle” in any kind of activity, as “a certain way of understanding, interpreting any object, phenomenon, process, the main point of view on the subject or phenomenon” [Ilyichev et al., 1973]. The more specialized term of a design concept is revealed as the main idea of the future object, the formulation of its semantic content, as the ideological and thematic base of the design concept, expressing the designer’s artistic and design judgment about phenomena larger than this object [Minervin et al., 2004], [Bianconi et al., 2020]. Thus, design at the conceptual level is positioned as the formation of a project at the level of the meaning or content of the concept of systems. The process of shaping in the activity of an architect or designer occurs in accordance with the general cultural values or other requirements related to the aesthetic expressiveness of the future object, its function, design and materials used [Shimko, 2004]. In the process of artistic shaping, the functional characteristics and figurative solution of the object, the aesthetic concept of the designer are fixed. Shaping as a synthesis of complex interacting aspects of lifestyle is most fully reflected in the creative concepts that formulate the goals and objectives of design. The concept of the exhibition as a cultural phenomenon reveals the essential features, objective laws of development and functioning in society of such a form as EXPO.
128 Recognition and Perception of Images
4.1.3 Compositional and Planning Basis for Creating the Environment of Exhibition Spaces In design, compositional shaping is to bring harmony between objects and people due to the relationship of spatial and compositional, visual, anthropometric and other structures that define the environment. The effectiveness of compositional and planning solutions of environmental spaces can be assessed using such basic criteria as: –– thematicity is revealed through the theme and motto of the exhibition; –– integrity is revealed by identifying the main elements of the space of international exhibitions: borders, gaps, entrances; –– compactness is revealed by identifying the main elements: structure – spatial organization of the exhibition environment and accessibility – the location of the exhibition space relative to the city, the location of transport communication relative to the exhibition; –– dominance is revealed through the accents and visual markers in the exhibition environmental space. It is necessary to take into account not only spatial and planning, but also semantic relationships of parts of the designed complexes to create a more complete figure of the environmental space.
4.1.4 Scenario Approach in the Figurative Solution of Environmental Spaces The overall solution of environmental spaces, subject saturation, the order of perception of objects, the change of “active accent places” and “passive paths” are thought out by architects and designers at the stage of conceptual design. The process of building a scenario is based on combining the nature of the event and the emotional potential of the environment “as a result of the development of a person’s life environment. Accordingly, human activity and behavior are accepted as a determining factor that binds individual elements of the environment into the integrity” [Shimko, 2007]. Scenario approach in design involves purposeful programming of visual impressions, both from the object and from the environment as a whole, the organization of the perception system and the experience of action. The action algorithm embedded in the scenario is consistently unfolded in the environment and is perceived by the user as an event. The event perception
Associative Perception of Conceptual Models 129 system provides management of the user’s emotional state, operates with his or her attention, and is aimed at building a behavioral scenario of a person in the environment. A scenario is a consistent description of alternative hypothetically possible scenarios for future events, which reflects different perspectives on the past, present, and future, and can also serve as a basis for planning actions [Van Notten, 2006]. A person strives to perceive the situation, to respond to spatial decisions, to improvise in the process of interaction with the environment, where the figure codes and universal bases of composition are laid, taking into account the features of dynamic perception.
4.1.5 Conceptual Approach to Creating Exhibition Spaces The conceptual approach is the main vector in the environmental design of the exhibition and the scenario of the organization of exhibition activities. At the stage of conceptual design, a defining idea is formed, research is conducted, and the parameters of the created solutions for objects and spaces are coordinated with their possible organization. The concept as an independent intellectual product is often in conflict with the traditional model of generally accepted norms and rules of thinking, which creates restrictions on the process of its emergence and implementation of the author’s idea. The core of the conceptual approach to the formation of the environment and the object itself is the main point of view of the author, his understanding and interpretation of the phenomenon or process. The leading idea defines the design principles of forming, both planning and volumetric solutions. In our opinion, the concept is a methodological basis for scenario design.
4.1.6 Perception of the Figurative Solution of the Environment Scenario, environmental and conceptual approaches form an increased expressiveness of the environment, enriching and updating the usual processes of human life. A complete model of the future object is formed by describing contradictions, identifying the problem of the project situation, and selecting themes and solutions. This model is implemented directly at the stage of project shaping, searching for such properties of the form that are most essential for the perception of specific information and the environment. Perception of the figurative solution of the environment, like any kind of conscious human activity, is always associated with the development of
130 Recognition and Perception of Images emotional states. Emotional states of a person are caused by an experience, his attitude to the external world, and response to signals from the external environment; they are a correction made by a person in response, determined only by the informational structure of the stimulus. The conceptual scenario proposals for creating the environment of exhibition spaces were developed with the expectation of an emotional response from visitors (Figure 4.1.1). Two draft versions of the scenario design scheme for the environmental space are presented. The concept is based on the idea of hospitality of our country. Exhibition halls are placed in volumes that are perceived as an open circle (embrace) with a pronounced communication center. Both draft versions are characterized by the fact that the perception of a complete figure occurs from the upper floors of the central volume. The scale of the environmental space is felt by visitors who are located on the main square formed by the volumes of the pavilions (Figure 4.1.1 and Figure 4.1.2). A variant of the scenario scheme is based on the idea of organizing space in a spiral of development, a return to the original, but at the highest level. This concept forms a multi-level space with a system of transition galleries. This proposal can be implemented both on flat terrain and on heterogeneous terrain using the technique of creating artificial embankments. The perception of the volumes of the next concept, with its constantly changing
Figure 4.1.1 Option 1. Welcome hugs.
Associative Perception of Conceptual Models 131
Figure 4.1.2 Option 2. Multi-level spiral.
height and volume, gives the visitor the impression of the dynamism of the environment [Liakou, Kosmas, 2019]. A variant of the scenario scheme is proposed as an organization of objects with a unified communication space. This concept is close to the idea of holding exhibitions in one place, at one time, and within a single environment (Figure 4.1.3). The design of the main volumes is performed using three-dimensional graphical modeling systems AutoCAD, 3D MAX [Carvajal et al., 2020]. And the modeling of the communicative unifying central space is carried out using modern parametric modeling technologies [Schumacher, 2009], [Leung et al., 2018], (Figure 4.1.4).
132 Recognition and Perception of Images
Figure 4.1.3 Option 3. Unification.
North America
South America
Eurasia
Africa Australia
Figure 4.1.4 Option 4. Continents.
The scenario scheme is based on the concept of uniting disparate continents. The environment is formed on five bulk Islands located on an artificial lake. Water acts as a unifying factor, at the same time giving a sense of some isolation. Connecting the frame elements are water transport and transition bridges made of light transforming elements (Figure 4.1.5). The next version of the scenario scheme is based on the modular principle of space organization. This method contributes to the formation of the environment, its development and growth of the exhibition city with subsequent synergy. The module can be selected in different configurations, a variant of forming the scheme using a triangle and hexagon has been proposed (Figure 4.1.6).
Associative Perception of Conceptual Models 133
Figure 4.1.5 Option 5. Combinatorics.
Figure 4.1.6 Option 6. Length.
A variant of the scenario scheme is based on the assumption that the exhibition will take place on an elongated section of the city or near the river. The organization is related to the use of a grid, the main axis of traffic direction, the park zones, and the division of the territory allocated for pavilions, taking into account the required area. When creating a model of the environment, the method of combinatorial shaping can be applied (Figure 4.1.6).
134 Recognition and Perception of Images
4.2 Associative Modeling of Environmental Objects in Exhibition Spaces 4.2.1 Conceptual and Figurative Basis for the Formation of Environmental Objects The design of national pavilions for international exhibitions differs from the actual design in its expressed conceptual basis, value orientation of the authors, and awareness of socio-cultural responsibility for the created work. In the concept, it is the artistic intent that determines the nature of the variety of compositional tools and techniques used to create the artistic integrity of the object. Evaluation of the effectiveness of conceptual-shaped bases of formation of environmental objects is possible through these key performance criteria of conceptual-figurative modeling environmental objects as: –– tectonics; –– context; –– semiotics. Tectonics takes into account the material and design interaction and space planning solution. Artistic and meaningful expression of the degree of stress state of the material form. Context is revealed by identifying such basic elements as time, politics, international situation, and style. Semiotics is based on the recognition of signs and sign systems in the figurative formation of environmental objects.
4.2.2 Associative and Imaginative Modeling of the Environmental Objects “Associative-figurative modeling is an experimental and creative field for searching for a figure of the future reality. This is a mental prototype, an impression of an idea expressed by special language means of art (volume, color, word, etc.). Associative and figurative modeling offers a holistic perception of an artistic figure with its iconic features, which are inherent in the abstract-sensuous form of objects and phenomena. In contrast to traditional modeling, which is based on the principle of similarity and correspondence, there is a variety of associative directions (in terms of similarity, difference, spatial and temporal proximity, etc.), which is dominated by individuality and surprise of results” [Medvedeva, 2003].
Associative Perception of Conceptual Models 135 Association as a psychological connection of ideas about various objects and phenomena developed by life experience. Each form expresses a certain character, evoking some association. The relationship between the elements of a composition, as a result of which the appearance of one element, under certain conditions, causes the figure of another associated with it, is called an associative composition. Associations create the preconditions for the abstracting. Based on existing figures, new figures (prototype ideas) are created, which form new associations through modeling. There is a cyclical relationship between the basic types of compositions: abstract – formal – associative.
4.2.3 Cognitive Bases of Perception of Associative-Figurative Models of Objects in Environmental Spaces Modern associative-figurative models of national pavilions should be created taking into account the knowledge asset of cognitive psychology, which studies the cognitive processes associated with the categories of thinking, feeling, memory, perception, attention. To quickly process information from sensory signals about the world around us and understand the meaning of each of them, our brain uses stereotypes and interprets visual signals [Abbasov, 2019]. The “verbal” form of brain work lags behind the “imaginative” one by ten times, the figurative memory is more stable, due to the fact that it is fixed by all the senses, and figures help to see the world holistically, colorfully and sonorously [Shirshov, 2001]. When creating the appearance of an environmental object, it is necessary to remember how important such mental processes as sensation, perception, emotions, and thinking are for reading and understanding the figure and concept of objects. Thus, the assessment of the figure of an environmental object can take place on the basis of comparing the perceived model with the internal figurative-conceptual model that has developed in a person, or as the appearance of its own conceptual model. Since visual figures are faster to remember and last longer in human memory, this is due to the associativity of memory and the ability to distinguish objects by form, it is necessary to rely on the semiotics of semantic social communication. In associative modeling, communication signs- figures, metaphors as a means of artistic and figurative expression in design are used. Metaphor allows transferring the features of various socio- cultural and natural phenomena of human life to the form of the object, to achieve an unexpected visual effect of an associative figure that arouses the interest of the exhibition visitor to a particular object, which has the author’s uniqueness [Zherdev, 2002].
136 Recognition and Perception of Images The form is a sign, a metaphor, a characteristic of time and human relations that defined the function, and the environment is a system of signs. One of the ways to identify the environment is to form architectural and design prototypes. The prototype as the most common combination of features as well as many similar forms of the same sample are stored in the memory of a person. Pattern recognition occurs when the perceived pattern (model) matches the perfect mental prototype.
4.2.4 Perception of the Figurative Solution of an Environmental Object To understand the essence of artistic formation in design and architecture, it is necessary to take into account the basic laws of perception psychology. For a long time, it was believed that a person perceives an object by assembling a whole from details, so that the emphasis was placed on detailing while designing. The integrity of the model became primary with the emergence of a new direction in psychology – Gestalt psychology (the science of perception) in the early twentieth century. The modern understanding of perception is based on the fact that only in general the details are perceived and manifested as parts of the system. Perception as a form of sensory reflection of an object includes the formation of a sensory figure. Techniques for achieving a certain effect in the perception of an object occur through the use of recognizable elements, volume details, colors, and locations in space. The perception of a model of an environmental object should be considered as an intellectual process associated with an active search for features necessary for the formation of a figure and its understanding. The result of the perception process is a constructed figure as a subjective vision of the real world, perceived with the help of the senses. The product of a person’s understanding of the situation, taking into account the tasks facing him, is a conceptual model.
4.2.5 Options of Conceptual and Figurative Modeling of Objects in Environmental Spaces Rudolf Arnheim, the greatest psychologist of the twentieth century, has systematized the ideas of Gestalt psychology and applied them to the analysis of works of visual arts, confirming that considering the laws of perception is one of the most important components of design. Arnheim’s “principles of the organization of the art form and its perception in the process of visual cognition of the world”, the main categories in design and architecture,
Associative Perception of Conceptual Models 137 were fully taken into account when creating the proposed figurative models of Russian national pavilions (Figures 4.2.1, 4.2.2, 4.2.3, 4.2.4). Three-dimensional modeling of environmental objects took into account such categories of perception psychology as: weight and balance; simplicity and complexity; figure and background; grouping; movement, and others [Arnheim, 2007]. It is necessary to perceive the proposed models taking into account the movement of visitors around and inside the objects, so that the dynamism and their change in static forms can be felt. To get the expected reaction from the exhibition visitor, the viewer’s mindset is adjusted through architectural and design techniques: fixing their attention with such artistic means as accent volume, color, metaphor, prototype, etc. The compositional and conceptual-figurative basis for designing national pavilions for international exhibitions, the artistic aspect of shaping consciously or subconsciously also relies on geometric archetypes of stability, dominance, striving forward and upward, thus positioning the political course, economic level, and socio-cultural vector of the country’s development (Figure 4.2.1). An option of the conceptual model of the environmental object of international exhibitions is proposed. The concept solution of the pavilion uses the archetypes of the square, stairs and spirals. The main volume of the pavilion is hidden from the eyes of visitors underground; they can only see the light and ventilation wells protruding at different levels, which serve as
Figure 4.2.1 The Concept of “Rebirth”.
138 Recognition and Perception of Images
Figure 4.2.2 The Concept of “Variability”.
Figure 4.2.3 The Concept of “Riddle”.
solar panels that accumulate light and heat. The roof is made of grass flooring and is being operated. You can get to the pavilion by a staircase that descends, as if cutting through the water surface of the pool that surrounds the entire area of the pavilion. The glass ceiling of the lower level, which is covered with a thin layer of pool water, transmits the diffused light. Glass floorings on the sides of the stairs serve as a kind of bridge for passage to the center of the composition.
Associative Perception of Conceptual Models 139
Figure 4.2.4 The concept of “Infinity of development”.
The use of different materials and textures of structural elements of the pavilion composition contributes to the creation of an imaginative solution of the contrast between the technogenic and natural environment (Figure 4.2.2). The following version of the conceptual model of the environmental object for international exhibitions is proposed, where the archetypes of a square, a spiral, and a triangle are used in the design of the pavilion. The diverse and multidimensional representation of Russia by residents of other countries is reflected in the appearance of the Russian pavilion through the transformation of the form perception over time. This effect is achieved by slow movement of the pavilion facade. Kinetic movement can be seen not only on the changeable facade, but also in the constructive organization of the pavilion itself. Its transformation is conceived as the ambiguity and variability of the figure of our country, which is constantly being developed.
140 Recognition and Perception of Images The base of the pavilion, its vertical core, is static and represents a parallelepiped. The kinetic shell covers the base and moves around it. The movement and stop of the elevator coordinates the variability of the facade and the stop of the floor. Such geometric archetypes underlying this concept of shaping, such as the square in the plan is associated with the stability of the country in the modern world, and the spiral formed by the movement of the “shell” means development, the negation of stagnation. In the form of individual elements of the kinetic volume of the pavilion, the triangle archetype is clearly traced. A variant of the conceptual model of the environmental object of international exhibitions is proposed, the conceptual basis of which is based on the ideas of Soviet constructivism, one of the examples of which is the “Black square” by Malevich. Its three-dimensional interpretation in the form of a cube as a primary form refers to the semiotics of ancient pagan symbols, where the square is understood as the earth, stability, and solidity. The figure interpretation of the versatility, sociability, openness, and diversity of the people of our country is reflected in the transparent faceted shell of the main volume, which is made of a light-absorbing material of black color, symbolizing the unknown, the mystery (Figure 4.2.3). For the integrity of the emotional and associative figure in this concept, it is important to solve the internal space of the central main hall of the pavilion, which assumes a concise space where the emotional response of visitors is formed through audio and video series filled with figures of a different order. A variant of the conceptual model of the environmental object of international exhibitions is proposed, the idea of forming which is based on the archetype of the spiral, as an infinite return to the original, but always at the highest stage, as the infinity of growth (Figure 4.2.4). The dynamic shape is emphasized by the spiral arrangement of floors, viewing platforms, and structural transparent elements. The volume is formed by intertwining two spirals around three vertical supports that serve as vertical communications. The effect of movement of the external volume is created by means of shaping techniques; this effect is supported inside the lifts by virtual reality technologies that organize flight over the expanses of Russia (Figure 4.2.4). The proposed concepts are based on the desire to use the achievements of modern technologies and also on the historical experience of forming a memorable associative figure; involving the viewer, the visitor in the process of scenario design when creating environmental objects and spaces.
Associative Perception of Conceptual Models 141
4.3 Conclusion Today, international exhibitions are accompanied by the mandatory allocation of the most relevant issues, which act as vectors of intellectual development in the world community. These conditions force participants to develop new materials and technologies; architects and designers are searching for new forms, techniques and methods of forming objects and exhibition spaces based on them. The proposed variants of conceptual models of the organization of the environment and national pavilions in Russia at international exhibitions were performed on the basis of methods of associative, figurative, conceptual modeling, scenario planning and compositional formation, taking into account the psychology of perception. Conceptual modeling of associative-figurative offers is aimed at achieving a positive emotional effect and a comfortable state of the visitor in the communicative and exhibition environment.
References Abbasov I.B. Psychology of Visual Perception. Amazon Digital Services LLC, 2019. 141p. ASIN: B07MPXKLQ7 [Electronic resource]. Arnheim R. Art and visual perception. Moscow: Architecture-S, 2007. 392 p. Bianconi F., Filippucci M., Buffi A., Vitali L. Morphological and visual optimization in stadium design: a digital reinterpretation of Luigi Moretti’s stadiums // Architectural Science Review. 2020, V.63, No2, P.194–209 https://doi.org/1 0.1080/00038628.2019.1686341 Bollini L., Borsotti M. Strategies of Commutation in Exhibition Design// International Journal of Architectonic, Spatial, and Environmental Design, 2016, doi:10.18848/2325-1662/CGP/v10i01/13-21 Boyko A. V. Conceptual formation of environmental objects of international exhibitions: Graduation Thesis / A. V. Boyko; AAA SFU. Rostov-on-Don: 2018. 74 p.: 8 suppl. Carvajal D.A.L, Morita M.M., Bilmes G.M. Virtual museums. Captured reality and 3D modeling//Journal of Cultural Heritage, 2020, in press, https://doi. org/10.1016/j.culher.2020.04.013. Ilichev L. V., Fedoseev P. N., Kovalev S. M., Panov V. G. Philosophical encyclopedia. Moscow: Soviet encyclopedia. 1983. 840 p. Leung T.M., Kukina I.V., Lipovka, A.Y. A parametric design framework for spatial structure of open space design in early design stage. Proceedings of the 25th ISUF International Conference. 25th ISUF International Conference: Urban
142 Recognition and Perception of Images Form and Social Context: from Traditions to Newest Demands. Krasnoyarsk 2018, 5-9 July / Krasnoyarsk: Sib. Feder. University, 2018. 214 p. P.99. Liakou M., Kosmas O. Analyzing museum exhibition spaces via visitor movement and exploration: The case of Whitworth Art Gallery of Manchester // IOP Publishing, 8th International Conference on Mathematical Modeling in Physical Science, Journal of Physics: Conference Series, 1391, 2019. 012171 doi:10.1088/1742-6596/1391/1/012171 Medvedeva O. P. Pedagogical conditions for creative self-development of a student’s personality in an institution of further education (based on the activity of a design atudio). Diss. of a Cand. of Ped. Sciences. Rostov-on-Don, 2003. 168 p., P. 111 Medvedeva O. P., Boyko A. V. World fairs: a social and cultural aspect. Modern Technologies in the World Scientific Space: Collection of Articles of the International Scientific and Practical Conference (May 25, 2017, Perm). 6 parts. Part 6 / Ufa: Aeterna, 2017. 241 p. P. 74-77. Minervin G. B., Shimko V. T., Efimov A.V. et al. Design. Illustrated Dictionary Manual / Under the general editorship of G. B. Minervin and V. T. Shimko. – M.: “Architecture S”, 2004. 288 p. Prokhorov, A. M. Great Soviet Encyclopedia. Volume 5. / Chief editor A. M. Prokhorov. 3rd ed. Moscow: Soviet Encyclopedia, 1971, 640 p. Schumacher P. Parametricism: A New Global Style for Architecture and Urban Design//Architectural Design, 2009, V.79, no. 4, P.14–23 Shimko V. T. Architectural and Design Design: Textbook. M: “Architecture S”, 2007. 159 p. Shimko V. T. Fundamentals of Design and Environmental design. Moscow: “Architecture S”, 2004. 160 p. Shirshov V. D. Introduction to pedagogical semiotics // Pedagogy. 2001. No. 6, p. 33-39. Van Notten Ph. Scenario development: a typology of approaches // Think Scenario. — Rethink Education. OECD, 2006. P. 69–84. Zherdev E. V. Metaphor in design: Theory and practice. Diss. of Doctor of Art History. Moscow, 2002. 253 p.
5 Disentanglement For Discriminative Visual Recognition Xiaofeng Liu
*
Harvard University, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA
Abstract
Recent successes of deep learning-based recognition rely on maintaining the content related to the main-task label. However, how to explicitly dispel the noisy signals for better generalization in a controllable manner remains an open issue. For instance, various factors such as identity-specific attributes, pose, illumination and expression affect the appearance of face images. Disentangling the identity-specific factors is potentially beneficial for facial expression recognition (FER). This chapter systematically summarizes the detrimental factors as task-relevant/irrelevant semantic variations and unspecified latent variation. In this chapter, these problems are cast as either a deep metric learning problem or an adversarial minimax game in the latent space. For the former choice, a generalized adaptive (N+M)-tuplet clusters loss function together with the identity-aware hard-negative mining and online positive mining scheme can be used for identity-invariant FER. The better FER performance can be achieved by combining the deep metric loss and softmax loss in a unified two fully connected layer branches framework via joint optimization. For the latter solution, it is possible to equip an end-to-end conditional adversarial network with the ability to decompose an input sample into three complementary parts. The discriminative representation inherits the desired invariance property guided by prior knowledge of the task, which is marginal independent to the task-relevant/ irrelevant semantic and latent variations. The framework achieves top performance on a series of tasks, including lighting, makeup, disguise-tolerant face recognition and facial attributes recognition. This chapter systematically
*Email: [email protected] Iftikhar B. Abbasov (ed.) Recognition and Perception of Images: Fundamentals and Applications, (143–188) © 2021 Scrivener Publishing LLC
143
144 Recognition and Perception of Images summarizes the popular and practical solution for disentanglement to achieve more discriminative visual recognition. Keywords: Visual recognition, disentanglement, deep metric learning, adversarial training, face recognition, facial attributes recognition
5.1 Introduction Extracting a discriminative representation for the task at hand is an important research goal of recognition [Liu, et al., 2019], [Liu, et al., 2019], [Liu, et al., 2018]. The typical deep learning solution utilizes the crossentropy loss to enforce the extracted feature representation has the sufficient information about the label [Liu, et al., 2019]. However, this setting does not require that the extracted representation is purely focused on the label, and usually incorporates the unnecessary information that is not related to the label [Liu, et al., 2019]. For example, the identity information in the facial expression recognition feature. Then, the perturbation of the identity will unavoidably result in the change of expression feature [Liu, et al., 2017]. These identityspecific factors degrade the FER performance of new identities unseen in the training data [Liu, et al., 2019]. Since spontaneous expressions only involve subtle facial muscle movements, the extracted expression-related information from different classes can be dominated by the sharp-contrast identity-specific geometric or appearance features which are not useful for FER. As shown in Figure 5.1, example x1 and x3 are of happy faces whereas x2 and x4 are not of happy faces. f(xi ) are the image representations using the extracted features. For FER, it is desired that two face images with the same expression label are close to each other in the feature space, while face images with different expressions are farther apart from each other, i.e., the distance D2 between examples x1 and x3 should be smaller than D1 and D3, as in Figure 5.1 (b). However, the learned expression representations may contain irrelevant identity information as illustrated in Figure 5.1 (a). Due to large inter-identity variations, D2 usually has a large value while the D1 and D3 are relatively small. Similarly, the expression related factors will also affect the recognition of face identity. Actually, the pure feature representation is the guarantee of a robust recognition system. Targeting for the problem of explicitly eliminating the detrimental variations following the prior knowledge of the task to achieve better generalization. It is challenging since the training set contains images annotated with multiple semantic variations of interest, but there is no example of the transformation (e.g., gender) as the unsupervised image translation
Disentanglement For Discriminative Visual Recognition 145 Happy Sad Surprise D1 f (x1)
f (x3)
f (x2)
f (x1)
D2
D2 D3 f (x3)
D1
f (x4) D3 f (x2)
f (x4)
(a)
x1 & x2 Є Subject1 x3 & x4 Є Subject2
(b)
Figure 5.1 Illustration of representations in feature space learned by (a) existing methods, and (b) the proposed method.
[Dong, et al., 2017], [Li, et al., 2015], and the latent variation is totally unspecified [Liu, et al., 2019]. Following the terminology used in previous multi-class dataset (including a main-task label and several side-labels) [Jha, et al., 2018], [Makhzani, et al., 2015], [Mathieu, et al., 2016], three complementary parts can be defined as in Figure 5.2. The factors relating to the side-labels are named as the semantic variations (s), which can be either task – relevant/irrelevant depending on whether they are marginally independent to the main recognition task or not. The latent variation (l) summarizes the remaining properties unspecified by main and semantic labels. How the DNN can systematically learn a discriminative representation (d) to be informative for the main recognition task, while marginally independent to multiple s and unspecified l in a controllable way remains challenging. Several efforts have been made to enforce the main task representation invariant to a single task-irrelevant (independent) semantic factor, such as pose, expression or illumination-invariant face recognition via neural preprocessing [Huang, et al., 2017], [Tian, et al., 2018] or metric learning [Liu, et al., 2017].
146 Recognition and Perception of Images Independent to
d
o iti fin n
l
S
Eqn. (3)
of
Eq Bo n. (4 ttl ) en + ec k
Eqn. (2)
De
Eqn. (3)
y
Eqn. (4)+ Bottleneck
Taskrelevant
S
TaskS irrelevant
Definition of S
Figure 5.2 Illustration of the expected separations of the observation x, which associated with the discriminative representation d (red), latent variation l (green) and semantic variations ŝ (blue). Our framework explicitly enforces them marginally independent to each other. The d and task-dependent s are related to the main-recognition task label y.
To further improve the discriminating power of the expression feature representations, and address the large intra-subject variation in FER, a potential solution is to incorporate the deep metric learning scheme within a convolutional neural network (CNN) framework [Liu, et al., 2018], [Liu, et al., 2018], [Liu, et al., 2019], [Liu, et al., 2017]. The fundamental philosophy behind the widely used triplet loss function [Ding, et al., 2015] is to require one positive example closer to the anchor example than one negative example with a fixed gap τ. Thus, during one iteration, the triplet loss ignores the negative examples from the rest of classes. Moreover, one of the two examples from the same class in the triplets can be chosen as the anchor point. However, there exist some special cases that the triplet loss function with impropriate anchor may judge falsely, as illustrated in Figure 5.3. This means the performance is quite sensitive to the anchor selection in the triplets input. They adapted the idea from the (N+1)-tuplet loss [Sohn, 2016] and coupled clusters loss (CCL) [Liu, et al., 2016] to design a (N+M)-tuplet clusters loss function which incorporates a negative set with N examples and a positive set with M examples in a minibatch. A reference distance T is introduced to force the negative examples to move away from the center of positive examples and for the positive examples to simultaneously map into a small cluster around their center c+. τ τ The circles of radius T + and T − centered at the c+ form the boundary 2 2 of the negative set and positive set respectively, as shown in Figure 5.3.
Disentanglement For Discriminative Visual Recognition 147
Positive Set
Two-branch FC layer Joint Metric Learning Network (N+M)-tuplet cluster loss & softmax loss
Negative Set
Positive examples Negative examples
Figure 5.3 Framework of our facial expression recognition model used for training. The deep convolutional network aims to map the original expression images into a feature space that the images of the same expression tend to form a cluster while other images tend to locate far away.
By doing this, it can handle complex distribution of intra- and interclass variations, and free the anchor selection trouble in conventional deep metric learning methods. Furthermore, the reference distance T and the margin τ can be learned adaptively via the propagation in the CNN instead of the manually set hyper-parameters. [Liu, et al., 2017] propose a simple and efficient mini-batch construction scheme that uses different expression images with the same identity as the negative set to avoid the expensive hard-negative example searching, while mining the positive set online. Then, the (N+M)-tuplet clusters loss guarantees all the discriminating negative samples are efficiently used per update to achieve an identityinvariant FER. Besides, jointly optimize the softmax loss and (N+M)-tuplet clusters loss is used to explore the potential of both the expression labels and identity labels information. Considering the different characteristics of each loss function and their tasks, two branches of fully connected (FC) layers are developed, and a connecting layer to balance them. The features extracted by the expression classification branch can be fed to the following metric learning processing. This enables each branch to focus better on their own task without embedding much information of the other. As shown in Figure 5.3, the inputs are two facial expression image set: one positive set (images of the same expression from different subjects) and one negative set (images of other expressions with the same identity of the query example). The deep features and distance metrics are learned simultaneously in a network. [Liu, et al., 2019], [Liu, et al., 2017] propose a generalized (N+M)-tuplet clusters loss function with adaptively learned reference threshold which can be seamlessly factorized into a linear-fully connected layer for an end-to-end learning. With the identity-aware negative mining and online positive mining
148 Recognition and Perception of Images scheme, the distance metrics can be learned with fewer input passes and distance calculations, without sacrificing the performance for identity-invariant FER. The softmax loss and (N+M)-tuplet clusters loss is optimized jointly in a unified two-branch FC layer metric learning CNN framework based on their characteristics and tasks. In experiments, [Liu, et al., 2017] demonstrate that the proposed method achieves promising results not only outperforming several state-of-art approaches in posed facial expression dataset (e.g., CK+, MMI), but also in spontaneous facial expression dataset (namely, SFEW). However, the metric learning-based solution bears the drawback that the cost used to regularize the representation is pairwise [Liu], which does not scale well as the number of values that the attribute can take could be large [Liu, et al., 2019]. Since the invariance we care about can vary greatly across tasks, these approaches require us to design a new architecture each time when a new invariance is required. Moreover, a basic assumption in their theoretical analysis is that the attribute is irrelevant to the prediction, which limits its capabilities in analyzing the task-relevant (dependent) semantic labels. These labels are usually used to achieve the attribute-enhanced recognition via the feature aggregation in multi-task learning [Hu, et al., 2017, Kingma and Ba, 2014], [Li, et al., 2018], [Peng, et al., 2017] (e.g., gender, age and ethnicity can shrink the search space for face identification). However, the invariance w.r.t. those attributes are also desired in some specific tasks. For example, the makeup face recognition system should be invariant to age, hair color, etc. Similarly, the gender and ethnicity are sensitive factors in fairness/bias-free classification when predicting the credit and health condition of a person. These semantic labels and the main task label are related due to the inherent bias within the data. A possible solution is setting this attribute as a random variable of a probabilistic model and reasoning about the invariance explicitly [Fu, et al., 2013, Liu, et al., 2015, Xiao, et al., 2017]. Since the divergence between a pair of distributions is used as the criteria to induce the invariance, the number of pairs to be processed grows quadratically with the number of attributes, which can be computationally expensive for the multiple variations in practice. Another challenge is how to achieve better generalization by dispelling those latent variations without the label. For instance, we may expect the face recognition system not only be invariant to the expression following the side label, but also applicable to different race, which do not have side label. Noticing that this problem also shares some similarity with feature disentanglements in image generation area [Guo, et al., 2013], [Makhzani, et al., 2015], while their goal is to improve content classification performance instead of synthesizing high-quality images.
Disentanglement For Discriminative Visual Recognition 149 Motivated by the aforementioned difficulties, [Liu, et al., 2019] proposes to enable a system which can dispel a group of undesired task-irrelevant/ relevant and latent variations in an unsupervised manner: it does not need paired semantic transformation example [Dong, et al., 2017], [Li, et al., 2015] and latent labels. Specifically, [Liu, et al., 2019] resort to an end-to-end conditional adversarial training framework. Their approach relies on an encoder-decoder architecture where, given an input image x with its main-task label y and to-be dispelled semantic variation label s, the encoders maps x to a discriminative representation d and a latent variation l, and the decoder is trained to reconstruct x given (d,s,l). It configures a semantic discriminator condition to s only, and two classifiers with inverse objectives which condition to d and l, respectively to constrain the latent space for manipulating multiple variations for better scalability. It is able to explicitly learn a task-specific discriminative representation with desired invariance property by systematically incorporating prior domain knowledge of the task. The to-be dispelled multiple semantic variations could be either task-dependent/independent, and the unspecified latent variation can also been eliminated in an unsupervised manner. Semantic discriminator and two inverse classifiers are introduced to constrain the latent space and result in a simpler training pipeline and better scalability. The theoretical equilibrium condition in different dependency scenarios have been analyzed. Extensive experiments on Extrended YaleB, 3 makeup set, CelebA, LFWA and DFW disgised face recognition benchmarks verifies its effectiveness and generality.
5.2 Problem Statement. Deep Metric Learning Based Disentanglement for FER FER focus on the classification of seven basic facial expressions which are considered to be common among humans [Tian, et al., 2005]. Much progress has been made on extracting a set of features to represent the facial images [Jain, et al., 2011]. Geometric representations utilize the shape or relationship between facial landmarks. However, they are sensitive to the facial landmark misalignments [Shen, et al., 2015]. On the other hand, appearance features, such as Gabor filters, Scale Invariant Feature Transform (SIFT), Local Binary Patterns (LBP), Local Phase Quantization (LPQ), Histogram of Oriented Gradients (HOG) and the combination of these features via multiple kernel learning are usually used for representing facial textures [Baltrušaitis, et al., 2015], [Jiang, et al., 2011],
150 Recognition and Perception of Images [Yüce, et al., 2015], [Zhang, et al., 2014]. Some methods such as active appearance models (AAM) [Tzimiropoulos and Pantic, 2013] combine the geometric and appearance representations to provide better spatial information. Due to the limitations of handcrafted filters [Liu, et al., 2017], [Liu, et al., 2018], extracting purely expression-related features is difficult. The developments in deep learning, especially the success of CNN [Liu], have made high-accuracy image classification possible in recent years [Che, et al., 2019], [Liu, et al., 2019], [Liu, et al., 2018], [Liu, et al., 2018]. It has also been shown that carefully designed neural network architectures perform well in FER [Mollahosseini, et al., 2016]. Despite its popularity, current softmax loss-based network does not explicitly encourage intraclass compactness and inter-class separation [Liu, et al., 2020], [Liu, et al., 2019]. The emerging deep metric learning methods have been investigated for person recognition and vehicle re-identification problems with large intra-class variations, which suggests that deep metric learning may offer more pertinent representations for FER [Liu, et al., 2019]. Compared to traditional distance metric learning, deep metric learning learns a nonlinear embedding of the data using the deep neural networks. The initial work is to train a Siamese network with contrastive loss function [Chopra, et al., 2005]. The pairwise examples are fed into two symmetric sub-networks to predict whether they are from the same class. Without the interactions of positive pairs and negative pairs, the Siamese network may fail to learn effective metrics in the presence of large intra- and interclass variations. One improvement is the triplet loss approach [Ding, et al., 2015], which achieved promising performance in both re-identification and face recognition problems. The inputs are triplets, each consisting of a query, a positive example and a negative example. Specifically, it forces the difference of the distance from the anchor point to the positive example and from the anchor point to the negative example to be larger than a fixed margin τ. Recently, some of its variations with faster and stable convergence have been developed. The most similar model of their proposed method is the (N+1)tuplet loss [Sohn, 2016]. x+ and x− denotong the positive and negative examples of a query example x, meaning that x+ is the same class of x, while x− is not. Considering (N+1) tuplet which includes x, x+ and N-1 negative exam-
{ }
− ples x j
(
N −1 j =1
, the loss is:
{ }
L x , x + , x −j
N −1 j =1
)
; f = log 1 +
∑
(
(
))
exp D( f , f + ) + τ − D f , f j− j =1
N −1
(5.1)
Disentanglement For Discriminative Visual Recognition 151 where f(·) is an embedding kernel defined by the CNN, which takes x and generates an embedding vector f(x) and write it as f for simplicity, with f inheriting all superscripts and subscripts. D(∙,∙) is defined as the Mahalanobis or Euclidean distance according to different implementations. The philosophy in this paper also shares commonality with the coupled clusters loss [Liu, et al., 2016], in which the positive example center c+ is set as the anchor. By comparing each example with this center instead of each other mutually, the evaluation times in a mini-batch are largely reduced. Despite their wide use, the above-mentioned frameworks still suffer from the expensive example mining to provide nontrivial pairs or triplets, and poor local optima [Liu, et al., 2019]. In practice, generating all possible pairs or triplets would result in quadratic and cubic complexity, respectively, and the most of these pairs or triplets are less valuable in the training phase. Also, the online or offline traditional mini-batch sample selection is a large additional burden. Moreover, as shown in Figure 5.4 (a), (b) and (c), all of them are sensitive to the anchor point selection when the intra- and inter-class variations are large. The triplet loss, (N+1)-tuplet loss and CCL are 0, since the distances between the anchor and positive examples are indeed smaller than the distance between the anchor and negative examples for a margin τ. This means the loss function will neglect these cases during the back propagation. It needs much more input passes with properly selected anchors to correct it. The fixed threshold in the contrastive loss was also proven to Anchor
f (x)
τ
f (x)
Anchor
τ f (x1–) f (x2–)
f (x+)
f (x+) f (x–)
f (x4–)
(a)
(b)
f (x1+) τ
c+
τ 2
c+
τ 2 f (x1–)
T
Anchor
f (x2–)
f (x2+) f (x1–) (c)
f (x3–)
f (x4–)
f (x3–)
(d)
Figure 5.4 Failed case of (a) triplet loss, (b) (N+1)-tuplet loss, and (c) Coupled clusters loss. The proposed (N+M)-tuplet clusters loss is illustrated in (d).
152 Recognition and Perception of Images be sub-optimal for it failed to adapt to the local structure of data. Li et al. proposed [Li, et al., 2013] to address this issue by learning a linear SVM in a new feature space. Some works [Goodfellow, et al., 2013, Wang, et al., 2014] used shrinkage-expansion adaptive constraints for pair-wise input, which optimized by alternating between SVM training and projection on the cone of all positive semidefinite (PSD) matrices, but their mechanism cannot be implemented directly in deep learning. A recent study presented objective comparisons between the softmax loss and deep metric learning loss and showed that they could be complementary to each other [Horiguchi, et al., 2016]. Therefore, an intuitive approach for improvement is combining the classification and similarity constraints to form a joint CNN learning framework. For example, [Sun, et al., 2014], [Yi, et al., 2014] combining the contrastive and softmax losses together to achieve a better performance, while [Zhang, et al., 2016] proposed to combine triplet and softmax loss via joint optimization. These models improve traditional CNN with softmax loss because similarity constraints might augment the information for training the network. The difficult learning objective can also effectively avoid overfitting. However, all these strategies apply the similarity as well as classification constraints directly on the last FC layer, so that harder tasks cannot be assigned to deeper layers (i.e., more weights) and interactions between constraints are implicit and uncontrollable. Normally, the softmax loss converges much faster than the deep metric learning loss in multi-task networks. This situation has motivated us to construct a unified CNN framework to learn these two loss functions simultaneously in a more reasonable way.
5.3 Adversarial Training Based Disentanglement The task of Feature-level Frankenstein (FLF) framework can be formalized as follows: Given a training set = {x 1 , s1 , y 1 },,{x M , s M , y M } , of M samples {image, semantic variations, class}, we are interested in the task of disentangling the feature representation of x to be three complementary parts, i.e., discriminative representation d, semantic variation s and latent variation l. These three codes are expected to be marginally independent with each other, as illustrated schematically in Figure 5.2. In the case of face, typical semantic variations including gender, expressions etc. All the remaining variability unspecified by y and s fall into the latent part l. Note that there are two possible dependency scenarios of s and y as discussed in Sec. 1. This will not affect the definition of l, and the information related to y should incorporate d and some of the task-dependent s.
Disentanglement For Discriminative Visual Recognition 153 Multi-task learning is a typical method to utilize multi-class label. It has been observed in many prior works that joint learning of main-task and relevant side tasks can help improve the performance in an aggregation manner [Hu, et al., 2017, Kingma and Ba, 2014, Li, et al., 2018, Peng, et al., 2017], while it are targeting for dispelling. Generative Adversarial Networks (GANs) has aroused increasing attraction. Conventionally, under the two-player (i.e., generator and discriminator) formulation, the vanilla GANs [Goodfellow, et al., 2014, Yang, et al., 2018] are good at generating realistic images, but their potential for recognition remains to be developed. The typical method use GANs as a preprocessing step of image, which is similar to the “denoise”, and then uses these processed images for normal training and testing [Huang, et al., 2017], [Liu, et al., 2017], [Lu, et al., 2017], [Netzer, et al., 2011], [Tenenbaum and Freeman, 2000], [Tian, et al., 2018], [Tzeng, et al., 2017]. [Liu, et al., 2019] deploy the trained network for predictions directly as a feature extractor. Comparing with the pixel-level GANs [Huang, et al., 2017], [Lu, et al., 2017], [Tian, et al., 2018], [Xie, et al., 2017], their feature-level competition results in much simpler training schemes and nicely scales to multiple attributes. Moreover, they usually cannot dispel task-relevant s, e.g., dispel gender from identity cannot get verisimilar face image for subsequent network training [Yang, et al., 2019]. Besides, they usually focus on a single variation for a specific task. Actually, the most of GANs and adversarial domain adaptation [Cao, et al., 2018], [Li, et al., 2014], [Tishby and Zaslavsky, 2015] use binary adversarial objective and applied for no more than two distributions. It is worth noting that some works of GANs, e.g., Semi-Supervised GAN [Kingma and Welling, 2013] and DR-GAN [Tian, et al., 2018] have claimed that they consider multiple side labels. Indeed, they have added a new branch for the multi-categorical classification, but their competing adversarial loss only confuses the discriminator by using two distributions (real or generated) and no adversarial strategies are adopted between different categories in the auxiliary multi-categorical classifier branch. [Liu, et al., 2019] are different from them in two aspects: 1) the input of semantic discriminator is feature, instead of real/synthesized image; 2) the goal of encoder needs to match or align the feature distribution between any two different attributes, instead of only real/fake distribution, and there is no “real” class in semantic discriminator. Fairness/bias-free classification also targets a representation that is invariant to certain task-relevant (dependent) factor (i.e., bias) hence makes the predictions fair [Edwards and Storkey, 2015]. As data-driven models trained using historical data easily inherit the bias exhibited in the data,
154 Recognition and Perception of Images the Fair VAEs [Liu, et al., 2015] tackled the problem using a Variational Autoencoder structure [Kushwaha, et al., 2018] approached with maximum mean discrepancy (MMD) regularization [Li, et al., 2018]. Xie, et al., 2017 proposed to regularize the l1 distance between representation distributions of data with different nuisance variables to enforce fairness. These methods have the same drawback that the cost used to regularize the representation is pairwise, which does not scale well for multiple taskirrelevant semantic variations. Latent variation disentangled representation is closely related to their work. It trying to separate the input into two complementary codes according to their correlation with the task for image transform in single label dataset setting [Bengio and Learning, 2009]. Early attempts [Simonyan and Zisserman, 2014] separate text from fonts using bilinear models. Manifold learning and VAEs were used in [Elgammal and Lee, 2004], [Kingma and Welling, 2013] to separate the digit from the style. “What-where” encoders [Zellinger, et al., 2017] combined the reconstruction criteria with discrimination to separate the factors that are relevant to the labels. Unfortunately, their approaches cannot be generalized to unseen identities. Added the GAN’s objective into the VAE’s objective to relax this restriction using an intricate triplet training pipeline. [Bao, et al., 2018, Hadad, et al., 2018, Hu, et al., 2018, Jiang, et al., 2017, Liu, et al., 2018] further reduced the complexity. Inspired by them, [Liu, et al., 2019] make their framework implicitly invariant to unspecified l for better generality in a simple yet efficient way, despite the core is dispel s and do not target for image analogies [Makhzani, et al., 2015].
5.4 Methodology. Deep Metric Learning Based Disentanglement for FER Here a simple description is given of the intuition to introduce a reference τ τ distance T to control the relative boundary T − and T + for the 2 2 positive and negative examples respectively, as shown in Figure 5.4 (d). The (N+1)-tuplet loss function in Eq. (5.1) can be rewritten as follows:
{ }
L x , x + , x −j
N −1 j =1
; f = log 1 +
∑
N −1
= log 1 +
∑
N −1
τ τ exp D( f , f + ) + −T + + T + − D f , f j− j =1 2 2
j =1
(
)
τ τ exxp D( f , f + ) − T + ∗ exp T + − D f , f j− 2 2
(
)
(5.2)
Disentanglement For Discriminative Visual Recognition 155 τ Indeed, the exp D(f , f + ) − T + term used to pull the positive exam 2 τ ple together and the exp T + + D f , f j− term used to push the negative 2 examples away have an “OR” relationship. The relatively large negative distance will make the loss function ignore the large absolute positive distance. One way to alleviate large intra-class variations is to construct an “AND” function for these two terms. The triplet loss can also be extended to incorporate N negative examples and M negative examples. Considering a multi-classification problem, the triplet loss and CCL only compare the query example with one negative example, which only guarantees the embedding vector of the query one to be far from a selected negative class instead of every class. The expectation of these methods is that the final distance metrics will be balanced after sufficient number of iterations. However, towards the end of the training, individual iteration may exhibit zero errors due to the lack of discriminative negative examples causing the iterations to be unstable or slow in convergence. The identity labels in FER database largely facilitate the hard-negative mining to alleviate the effect of the inter-subject variations. In practice, for a query example, [Liu, et al., 2017] compose its negative set with all the different expression images of the same person. Moreover, randomly choosing one or a group of positive examples is a paradigm of the conventional deep metric methods, but some extremely hard positive examples may distort the manifold and force the model to be over-fitting. In the case of spontaneous FER, the expression label may erroneously be assigned due to the subjectivity or varied expertise of the annotators [Barsoum, et al., 2016, Zafeiriou, et al., 2016]. Thus, an efficient online mining for M randomly chosen positive examples should be designed for large intra-class variation datasets. [Liu, et al., 2017] find the nearest negative example and ignore those positive examples with a larger distance. Algorithm 1 shows the detail. In summary, the new loss function is expressed as follows:
( )
{ } ,{x }
L xi+
M
i =1
− j
τ max 0, D( f + , c + ) − T + i =1 2
1 ;f= j =1 M*
∑
1 N
∑
N
+
M*
τ max 0,T + − D f j− , C + j =1 2
N
(
)
(5.3)
156 Recognition and Perception of Images The simplified geometric interpretation is illustrated in Figure 5.4 (d). Only if the distances from online mined positive examples to the τ updated c+ smaller than T − and the distances to the updated c+ 2 τ than T + , the loss can get a zero value. This is much more consistent with 2 the principle used by many data cluster and discriminative analysis methods. One can see that the conventional triplet loss and its variations become the special cases of the (N+M)-tuplet clusters loss under their framework. For a batch consisting of X queries, the input passes required to evaluate the necessary embedding feature vectors in the application are X, and the total number of distance calculations can be 2(N + M) ∗ X. Normally, the N and M are much smaller than X. In contrast, triplet loss requires C 3X passes and 2C 3X times calculations, (N+1)-tuplet loss requires (X + 1) ∗ X passes and (X + 1) ∗ X2 times calculations. Even for a dataset with a moderate size, it is intractable to load all possible meaningful triplets into the limited memory for model training. By assigning different values for T and τ, [Liu, et al., 2017] define a flexible learning task with adjustable difficulty for the network. However, the two hyper-parameters need manual tuning and validation. In the spirit of adaptive metric learning for SVM [Li, et al., 2013], [Liu, et al., 2017] formulate the reference distance to be a function T(∙,∙) related with each example instead of a constant. Since the Mahalanobis distance matrix M in Eq. (5.4) itself is quadratic, and can be calculated automatically via a linear fully connected layer as in [Shi, et al., 2016], [Liu, et al., 2017] assume 1 T(f1, f2) as a simple quadratic form, i.e., T ( f1 , f 2 ) = Z t Qz + ω t z + b , where 2 Qf1f1 Qf1f 2 t t t 2 d 2 2 d × d ∈ z = f1 f 2 ∈ , Q = , ω t = ω tf1ω tf2 ∈ 2d , Qf 2f1 Qf 2f 2 2d b ∈, f1 and f 2 ∈ are the representations of two images in the feature space.
D( f1 , f 2 ) = // f1 − f 2 //2M = ( f1 − f 2 )T M( f1 − f 2 )
(5.4)
Due to the symmetry property with respect to f1 and f2, T(f1, f2) can be rewritten as follows:
T ( f1 , f 2 ) =
1 t 1 f 2 + c t ( f1 + f 2 ) + b f1 Af1 + f 2t A n + f1t B 2 2
(5.5)
Disentanglement For Discriminative Visual Recognition 157 = Q f f = Q f f and B = Q f f = Q f f are both the d × d real symwhere A 1 2 2 1 1 1 2 2 metric matrices (not necessarily positive semi-definite), c = ω f1 = ω f2 is a d-dimensional vector, and b is the bias term. Then, a new quadratic formula H(f1, f2) = T(f1, f2) – D(f1, f2) is defined to combine the reference distance function and distance metric function. Substituting Eq. (5.4) and Eq. (5.5) to H(f1, f2), we get:
1 t 1 − 2M ) f f1 ( A − 2M ) f1 + f 2t ( A 2 2 + 2M)) f 2 + c t ( f1 + f 2 ) + b2 + f1t ( B
H ( f1 , f 2 ) = H ( f1 , f 2 ) =
(5.6)
1 t 1 f1 Af1 + f 2t Af 2 + f1t Bf 2 + c t ( f1 + f 2 ) + b 2 2
(5.7)
− 2M) and B = (B + 2M). Suppose A is positive semi-definite where A = (A (PSD) and B is negative semi-definite (NSD), A and B can be factorized as LTA L A and LTB L B. Then H(f1, f2) can be formulated as follows:
1 t T 1 f1 L A L A f1 + nt LTA L A f 2 + f1t LTB L B f 2 + c t ( f1 + f 2 ) + b 2 2 1 1 = (L A f1 )t (L A f1 ) + (L A f 2 )t (L A f 2 ) + (L B f1 )t (L B f 2 ) + c t f1 + c t f 2 + b 2 2
H ( f1 , f 2 ) =
(5.8)
Motivated by the above, [Liu, et al., 2017] propose a general, computational feasible loss function. Following the notations in the preliminaries and denote (LA, LB, c)T as W:
{ } ,{ x }
L W , xi+
1 M*
∑
M
i =1
− j
N j =1
;f=
τ 1 max 0 , H fi + ,c + + + i =1 2 N M*
(
)
∑
(
)
τ max 0, H f j− ,c + + j =1 2
N
(5.9)
Given the mined N+M* training examples in a mini-batch, l(∙) is a label function. If the example xk is from the positive set, l(xk) = −1, otherwise, τ l(xk) = 1. Moreover, the can be simplified to be the constant 1, and changing 2
158 Recognition and Perception of Images it to any other positive value results only in the matrices being multiplied by corresponding factors. The hinge-loss like function is:
(
{ } ,{ x }
L W , xi+
M
i =1
− j
N j =1
)
;f =
1 N + M*
∑
N +M* k =1
max(0, l( x k )*H ( f k , c + ) + 1)
(5.10)
[Liu, et al., 2017] optimize Eq. (5.10) using the standard stochastic gradient descent with momentum. The desired partial derivatives of each example are computed as:
∂L 1 = l ∂W N + M*
∑
N+ M* k =1
∂L ∂X lk ∂X lk ∂W l
∂L ∂L ∂X l+1 k = l ∂X lk ∂X l+1 ∂ X k k
(5.11)
(5.12)
where X kl represents the feature map of the example xk at the lth layer. Eq. (5.11) shows that the overall gradient is the sum of the example-based gradients. Eq. (5.12) shows that the partial derivative of each example with respect to the feature maps can be calculated recursively. So, the gradients of network parameters can be obtained with back propagation algorithm. In fact, as a straightforward generalization of conventional deep metric learning methods, the (N+M)-tuplet clusters loss can be easily used as a drop-in replacement for the triplet loss and its variations, as well as used in tandem with other performance-boosting approaches and modules, including modified network architectures, pooling functions, data augmentations or activation functions. The proposed two-branch FC layer joint metric learning architecture with softmax loss and (N+M)-tuplet clusters loss, denoted as 2B(N+M) Softmax. The convolutional groups of the network are based on the inception FER network presented in [Mollahosseini, et al., 2016]. [Liu, et al., 2017] adopt the parametric rectified linear unit (PReLU) to replace the conventional ReLU for its good performance and generalization ability when given limited training data. In addition to providing the sparsity to gain benefits discussed in [Arora, et al., 2014], the inception layer also allows for improved recognition of local features. The locally applied smaller convolution filters seem to align the way that human process emotions with the deformation of local muscles.
Disentanglement For Discriminative Visual Recognition 159 Combing the (N+M)-tuplet clusters loss and softmax loss is an intuitive improvement to reach a better performance. However, conducting them directly on the last FC layer is sub-optimal. The basic idea of building a two-branch FC layers after the deep convolution groups is combining two losses in different level of tasks. [Liu, et al., 2017] learn the detailed features shared between the same expression class with the expression classification (EC) branch, while exploiting semantic representations via the metric learning (ML) branch to handle the significant appearance changes from different subjects. The connecting layer embeds the information learned from the expression label-based detail task to the identity label-based semantical task, and balances the scale of weights in two task streams. This type of combination can effectively alleviate the interference of identity-specific attributes. The inputs of connecting layer are the output vectors of the former FC layers- FC2 and FC3, which have the same dimension denoted as Dinput. The output of the connecting layer, denoted as FC4 with dimension Douput, is the feature vector fed into the second layer of the ML branch. The connecting layer concatenates two input feature vectors into a larger vector and maps it into a Doutput dimension space:
FC4 = P T[FC2 ; FC3 ] = P1T FC2 + P2T FC3
(5.13)
where P is a 2(Dinput × Doutput) matrix, P1 and P2 are Dinput × Doutput matrices. Regarding the sampling strategy, every training image is used as a query example in an epoch. In practice, the softmax loss will only be calculated for the query example. The importance of two loss functions is balanced by a weight α. During the testing stage, this framework takes one facial image as input, and generates the classification result through the EC branch with the softmax loss function.
5.5 Adversarial Training Based Disentanglement 5.5.1 The Structure of Representations For the latent variation encoding, [Liu, et al., 2019] choose the l to be a vector of real value rather than a one-hot or a class ordinal vector to enable the network to be generalized to identities that are not presented in the training dataset as in [Bao, et al., 2018, Makhzani, et al., 2015]. However, as the semantic variations are human-named for a specific domain, this concern is removed. In theory, s can be any type of data (e.g., continuous value
160 Recognition and Perception of Images scalar/vector, or a sub-structure of a natural language sentence) as long as it represents a semantic attribute of x under their framework. For simplicity, [Liu, et al., 2019] consider here the case where s is a N-dimensional binary variable for N to-be controlled semantic variations. Regarding the multi-categorical labels, they are factorized to multiple binary choices. The domain adaptation could be a special case of their model when the semantic variation is the Bernoulli variable which takes the one-dimensional binary value (i.e., s = {0, 1}), representing the domains.
5.5.2 Framework Architecture The model described in Figure 5.5 is proposed to achieve the objective based on an encoder-decoder architecture with conditional adversarial training. At inference time, a test image is encoded to the d and l in the latent space, and the d can be used for recognition task with desired invariant property w.r.t. the s. Besides, the user can choose the combination of (d,s,l) that are fed to the decoder for different image transforms.
5.5.3 Informative to Main-Recognition Task The discriminative encoder Ed with parameter θ Ed maps an input image to its discriminative representation d = Ed(x) which is informative for the main recognition task and invariant to some semantic attributes. By invariance, we mean that given two samples x1, x2 from a subject class (y1 = y2) but with different semantic attribute labels (s1 ≠ s2), their d1 and d2 are expected to be the same. Given the obtained d, [Liu, et al., 2019] expect to predict its corresponding label y with the classifier Cd to model the distribution
Input sample
x
Ed
d Discriminative representation
Softmax
Cd
y12 y
Dis s
pDis (s | d ) Semantic variations
~ x
Dec l
Latent variation
Cl
Softmax
El
pC (y | d ) d
yM
y12 y yM
= Dec (d, s, l )
Reconstructed sample
pCl (y | l )
Forward propagation
Back propagation
Figure 5.5 Failed case of (a) triplet loss, (b) (N+1)-tuplet loss, and (c) Coupled clusters loss. The proposed (N+M)-tuplet clusters loss is illustrated in (d).
Disentanglement For Discriminative Visual Recognition 161 pcd ( y|x ). The task of Cd and the first objective of the Ed is to ensure the accuracy of the main recognition task. Therefore, [Liu, et al., 2019] update them to minimize:
min Cd = x , y~q( x ,s , y ) - log pCd ( y|Ed ( x ))
Ed ,Cd
(5.14)
where the categorical cross-entropy loss is used for the classifier. The q (x,s,y) is the true underlying distribution that the empirical observations are drawn from.
5.5.4 Eliminating Semantic Variations The discriminator Dis output probabilities of an attribute vector pDis(s|d). In practical implementation, this is made by concatenating d and binary attributes code s for input and outputs the [0,1] values using the sigmoid unit. Its loss depends on the current state of semantic encoders and is written as:
min max Dis = x ,s~q( x ,s , y ) - log pDis (s|Ed ( x )) Dis
Ed
(5.15)
Concretely, the Dis and Ed form an adversarial game, in which the Dis is trained to detect an attribute of data by maximizing the likelihood pDis(s|d), while the Ed fights to conceal it by minimizing the same likelihood. Eq. 5.15 guarantees that d is marginally independent to s. Supposing that a semantic variation follows the Bernoulli distribution, the loss is formulated as {–slogDis(d) + (1 − s)log(1 – Dis(d))}. The proposed framework is readily amenable to control multiple attributes by extending the dimension of semantic variation vector. With N to-be dispelled semantic variations, log
∑
N
log pDis (s|d ) = {log pDis (si |d )}. Note that even with binary attribute vali =1 ues at the training stage, each attribute can be considered as a continuous variable during inference to choose how much a specific attribute is perceivable in the generated images. As discussed above, the semantic discriminator is essentially different from conventional GANs. The feature-level competition is also similar to adversarial auto-encoder [Maaten and Hinton, 2008], which match the intermediate feature with a prior distribution (Gaussian). However, it is conditioned to another vector s, and require the encoder align the distribution between any two s, instead of only real/fake.
162 Recognition and Perception of Images
5.5.5 Eliminating Latent Variation To train the latent variation encoder El, [Liu, et al., 2019] propose a novel variant of adversarial networks, in which the El plays a minimax game with a classifier Cl instead of a discriminator. The Cl inspects the background latent variation l and learns to predict class label correctly, while the El is trying to eliminate task-specific factors d by fooling Cl to make false predictions.
min max Cl = x , y~q( x ,s , y ) - log pCl ( y|El ( x )) Cl
El
(5.16)
Since the ground truth of d is unobservable, [Liu, et al., 2019] use the y in here, which incorporate d and main-task relevant s. [Liu, et al., 2019] also use softmax output unit and cross-entropy loss in their implementations. In contrast to using three parallel VAEs [Makhzani, et al., 2015], the adversarial classifiers are expected to alleviate the costly training pipeline and facilitate the convergence.
5.5.6 Complementary Constraint The decoder Dec is a deconvolution network to produce a new version of the input image given the concatenated codes (d,s,l). These three parts should contain enough information to allow the reconstruction of the input x. Herein, [Liu, et al., 2019] measure the similarity of the reconstruction with the self-regularized mean squared error (MSE) for simply:
min rec = x ,s , y~q( x ,s , y ) || Dec(d , s , l ) − x ||22
Ed ,El ,Dec
(5.17)
This design contributes to variation separation in an implicit way, and makes the encoded features more inclusive of the image content.
5.6 Experiments and Analysis 5.6.1 Deep Metric Learning Based Disentanglement for FER For a raw image in the database, face registration is a crucial step for good performance. The bidirectional warping of Active Appearance Model (AAM) [30] and a Supervised Descent Method (SDM) called IntraFace
Disentanglement For Discriminative Visual Recognition 163 model [45] are used to locate the 49 facial landmarks. Then, face alignment is done to reduce in-plane rotation and crop the region of interest based on the coordinates of these landmarks to a size of 60×60. The limited images of FER datasets is a bottleneck of deep model implementation. Thus, an augmentation procedure is employed to increase the volume of training data and alleviate the chance of over-fitting. [Liu, et al., 2017] randomly crop the 48×48 size patches, flip them horizontally and transfer them to grayscale images. All the images are processed with the standard histogram equalization and linear plane fitting to remove unbalanced illumination. Finally, [Liu, et al., 2017] normalize them to a zero mean and unit variance vector. In the testing phase, a single center crop with the size of 48×48 is used as input data. Following the experimental protocol in [Mollahosseini, et al., 2016], [Yu and Zhang, 2015], [Liu, et al., 2017] pre-train their convolutional groups and EC branch FC layers on the FER2013 database [Goodfellow, et al., 2013] for 300 epochs, optimizing the softmax loss using stochastic gradient decent with a momentum of 0.9. The initial network learning rate, batch size, and weight decay parameter are set to 0.1, 128, 0.0001, respectively. If the training loss increased more than 25% or the validation accuracy does not improve for ten epochs, the learning rate is halved and the previous network with the best loss is reloaded. Then the ML branch is added and the whole network is trained by 204,156 frontal viewpoints (-45° to 45°) face images selected from the CMU Multi-pie [Gross, et al., 2010] dataset. This contains 337 people displaying disgust, happiness, and surprise as well as neutral expressions. The size of both the positive and negative set are fixed to 3 images. The weights of two loss functions are set equally. [Liu, et al., 2017] select the highest accuracy training epoch as the pre-trained model. In the fine-tuning stage, the positive and negative set size are fixed to 6 images (for CK+ and SFEW) or 5 images (for MMI). For a query example, the random searching is employed to select the other 6 (or 5) same expression images to form the positive set. Identity labels are required for negative mining in their method. CK+ and MMI have the subject IDs while the SFEW need manually label. In practice, an off-the-shelf face recognition method can be used to produce this information. When the query example lacks some expression images from the same subject, the corresponding expression images sharing the same ID with the any other positive examples are used. The tuplet-size is set to 12, which means 12× (6+6) =144 (or 12×(5+5) =120) images are fed in each training iteration. [Liu, et al., 2017] use Adam [Kingma and Ba, 2014] for stochastic optimization and other hyper-parameters such as learning rate are tuned
164 Recognition and Perception of Images accordingly via cross-validation. All the CNN architectures are implemented with the widely used deep learning tool Caffe [Jia, et al., 2014]. To evaluate the effectiveness of the proposed method, extensive experiments have been conducted on three well-known publicly available facial expression databases: CK+, MMI and SFEW. For a fair comparison, [Liu, et al., 2017] follow the protocol used by previous works [Mollahosseini, et al., 2016, Yu and Zhang, 2015]. Three baseline methods are employed to demonstrate the superiority of the novel metric learning loss and twobranch FC layer network, respectively, i.e., adding the (N+M)-tuplet clusters loss or (N+1)-tuplet loss with softmax loss after the EC branch, denoted as 1B(N+1)Softmax or 1B(N+M)Softmax, and combining the (N+1)-tuplet loss with softmax loss via the two-branch FC layer structure, as 2B(N+1)Softmax. With randomly selected triplets, the loss failed to converge during the training phase. The extended Cohn-Kanade database (CK+) [Lucey, et al., 2010] includes 327 sequences collected from 118 subjects, ranging over 7 different expressions (i.e., anger, contempt, disgust, fear, happiness, sadness, and surprise). The label is only provided for the last frame (peak frame) of each sequence. [Liu, et al., 2017] select and label the last three images, and obtain 921 images (without neutral). The final sequence-level predictions are made by selecting the class with the highest possibility of the three images. [Liu, et al., 2017] split the CK+ database to 8 subsets in a strict subject independent manner, and an 8-fold cross-validation is employed. Data from 6 subsets is used for training and the others are used for validation and testing. The confusions matrix of the proposed method evaluated on the CK+ dataset is reported in Table 5.1. It can be observed that the expressions of disgust and happiness are perfectly recognized while the contempt expression is relatively harder for the network because of the limited training examples and subtle muscular movements. As shown in Table 5.3, the proposed 2B(N+M)Softmax outperforms the human-crafted feature-based methods, sparse coding-based methods and the other deep learning methods in comparison. Among them, the 3DCNN-DAP, STM-Explet and DTAGN utilized temporal information extracted from sequences. Not surprisingly, it also beats the baseline methods obviously benefit from the combination of novel deep metric learning loss and two-branch architecture. The MMI database [Pantic, et al., 2005] includes 31 subjects with frontalview faces among 213 image sequences which contain a full temporal pattern of expressions, i.e., from neutral to one of six basic expressions as time goes on, and then released. It is especially favored by the video-based methods to exploit temporal information. [Liu, et al., 2017] collect three frames in the middle of each image sequence and associate them with
Disentanglement For Discriminative Visual Recognition 165 Table 5.1 Average confusion matrix obtained from proposed method on the CK+ database.
Actual
Predict AN
CO
DI
FE
HA
SA
SU
AN
91.1%
0%
0%
1.1%
0%
7.8%
0%
CO
5.6%
90.3%
0%
2.7%
0%
5.6%
0%
DI
0%
0%
100%
0%
0%
0%
0%
FE
0%
4%
0%
98%
2%
0%
8%
HA
0%
0%
0%
0%
100%
0%
0%
SA
3.6
0%
0%
1.8%
0%
94.6%
0%
SU
0%
1.2%
0%
0%
0%
0%
98.8%
the labels, which results in 624 images in their experiments. [Liu, et al., 2017] divide MMI dataset into 10 subsets for person-independent ten-fold cross validation. The sequence-level predictions are obtained by choosing the class with the highest average score of the three images. The confusion matrix of the proposed method on the MMI database is reported in Table 5.2. As shown in Table 5.3, the performance improvements in this small database without causing overfitting are impressive. The proposed method outperforms other works that also use static image-based features Table 5.2 Average confusion matrix obtained from proposed method on the MMI database.
Actual
Predict AN
DI
FE
HA
SA
SU
AN
81.8%
3%
3%
1.5%
10.6%
0%
DI
10.9%
71.9%
3.1%
4.7%
9.4%
6%
FE
5.4%
8.9%
41.4%
7.1%
7.1%
30.4%
HA
1.1%
3.6%
0%
92.9%
2.4%
0%
SA
17.2%
7.8%
0%
1.6%
73.4%
0%
SU
7.3%
0%
14.6%
0%
0%
79.6%
CK+
91.4%
91.44%
96.7%
95.1%
92.4%
94.19%
97.25%
93.2%
93.21%
94.3%
96.55%
97.1%
Methods
MSR [33]
ITBN [44]
BNBN [25]
IB-CNN [11]
3DCNN-DAP [23]
STM-Explet [24]
DTAGN [16]
Inception [28]
1B(N+1)Softmax
2B(N+1)Softmax
1B(N+M)Softmax
2B(N+M)Softmax 78.53%
77.88%
78.04%
77.72%
77.6%
70.2%
75.12%
63.4%
N/A
N/A
59.7%
N/A
MMI
48.5% 43.58% 51.02% N/A 53.06% 44.7% 47.7%
Ng et al. [Ng, et al., 2015] Yao et al. [Yao, et al., 2015] Sun et al. [Sun, et al., 2015] Zong et al. [Zong, et al., 2015] Kaya et al. [Kaya and Salah, 2016] Mao et al. [Mao, et al., 2016] Mollahosseini [Mollahosseini, et al., 2016]
2B(N+M)Softmax
1B(N+M)Softmax
2B(N+1)Softmax
54.19%
53.36%
50.75%
49.77%
53.9%
Kim et al. [Kim, et al., 2016]
1B(N+1)Softmax
SFEW
Methods
Table 5.3 Recognition accuracy comparison on the CK+ database [26] in terms of seven expressions, MMI database [Pantic, et al., 2005] in terms of six expressions, and SFEW database in terms of seven expressions.
166 Recognition and Perception of Images
Disentanglement For Discriminative Visual Recognition 167 and can achieve comparable and even better results than those video-based approaches. The static facial expressions in the wild (SFEW) database [Dhall, et al., 2015] is created by extracting frames from the film clips in the AFEW data corpus. There are 1766 well-labeled images (i.e., 958 for training, 436 for validation and 372 for testing) being assigned to be one of the 7 expressions. Different from the previous two datasets, it targets for unconstrained facial expressions, which has large variations reflecting real-world conditions. The confusion matrix of their method on the SFEW validation set is reported in Table 5.3. The recognition accuracy of disgust and fear are much lower than the others, which is also observed in other works. As illustrated in Table 5.4, the CNN-based methods dominate the ranking list. With the augmentation of deep metric learning and two-branch FC layer network, the proposed method works well in the real-world environment setting. Note that Kim et al. [Kim, et al., 2016] employed 216 AlexNet-like CNNs with different architectures to boost the final performance. Their network performs about 25M operations, almost four times fewer than a single AlexNet. With the smaller size, the evaluation time in testing phase takes only 5ms using a Titan X GPU, which makes it applicable for real-time applications. Overall, one can see that joint optimizing the metric learning loss and softmax loss can successfully capture more discriminative expressionrelated features and translate them into the significant improvement of FER accuracy. The (N+M)-tuplet clusters loss not only inherits merits of conventional deep metric learning methods, but also learns features in a more Table 5.4 Average confusion matrix obtained from proposed method on the SFEW validation set.
Actual
Predict AN
DI
FE
HA
NE
SA
SU
AN
66.24%
1.3%
0%
6.94%
9.09%
5.19%
10.69%
DI
21.74%
4.35%
4.35%
30.34%
13.04%
4.35%
21.74%
FE
27.66%
0%
6.38%
8.51%
10.64%
19.15%
27.66%
HA
0%
0%
0%
87.67%
6.85%
1.37%
4.11%
NE
5.48%
0%
2.74%
1.37%
57.53%
5.48%
27.4%
SA
22.81%
0%
1.75%
7.02%
8.77%
40.35%
19.3%
SU
1.16%
0%
2.33%
5.81%
17.44%
0%
73.26%
168 Recognition and Perception of Images efficient and stable way. The two-branch FC layer can further give a boost in performance. Some nice properties of the proposed method are verified by Figure 5.6, where the training loss of 2B(N+M)Softmax converges after about 40 epochs with a more steady decline and reaches a lower value than those baseline methods as expect. As Figure 5.7 illustrates, the proposed method and the baseline methods achieve better performance in terms of the validation accuracy on the training phase. 6 Softmax loss only 1B[(N+1) & Softmax]
5
2B[(N+1) & Softmax] 1B[(N+M) & Softmax]
Triang loss
4
2B[(N+M) & Softmax]
3
2
1
0
0
10
20
30
40
50
60
Number of epoch 0.6
Validation accuracy
0.55
0.5
0.45 Softmax loss only
0.4
1B[(N+1) & Softmax] 2B[(N+1) & Softmax]
0.35
0.3
1B[(N+M) & Softmax] 2B[(N+M) & Softmax]
0
10
20
30
40
50
60
Number of epoch
Figure 5.6 The training loss of different methods on SFEW validation set. The validation accuracies of different methods on SFEW validation set.
Disentanglement For Discriminative Visual Recognition 169
(a)
(b)
Figure 5.7 t-SNE visualization of images in Extended YaleB. The original images (b) are clustered according to their lighting environments, while the discriminative representation learned by our framework (a) is more likely to cluster with only identities.
5.6.2 Adversarial Training-Based Disentanglement To illustrate the behavior of the Feature-level Frankenstein (FLF) framework, [Liu, et al., 2019] quantitatively evaluate the discriminative representation with desired invariance property on three different recognition tasks and also offer qualitative evaluations by visually examining the perceptual quality of conditional face generation. As the frequent metrics (e.g., log-likelihood of a set of validation samples) are not meaningful for perceptual generative models [Sun, et al., 2017], [Liu, et al., 2019] measure the information associated with the semantic variations s or main-task label y that is contained in each representation part to evaluate the degree of disentanglement as in [Liu, et al., 2015, Makhzani, et al., 2015]. In all experiments, [Liu, et al., 2019] utilize the Adam optimization method [Kingma, et al., 2014] with a learning rate of 0.001 and beta of 0.9 for the training of the encoders-decoder network, discriminator and classifiers. [Liu, et al., 2019] use a variable weight for the discriminator loss coefficient α. [Liu, et al., 2019] initially set α to 0 and the model is trained as a normal auto-encoder. Then, α is linearly increased to 0.5 over the first 500,000 iterations to slowly encourage the model to produce invariant representations. This scheduling turned out to be critical in their experiments. Without it, they observed that the Ed was too affected by the loss coming from the discriminator, even for low values of α. All the models were implemented using TensorFlow.
170 Recognition and Perception of Images For our lighting-tolerant classification task, [Liu, et al., 2019] use the Extended Yale B dataset [Georghiades, et al., 2001]. It comprises face images from 38 subjects under 5 different lighting conditions, i.e., front, upper left, upper right, lower left, or lower right. [Liu, et al., 2019] aim to predict the subject identity y using d. The semantic variable s to be purged here is the lighting condition, while the latent variation l does not have practical meaning in this dataset setting. [Liu, et al., 2019] follow the two-layer Ed structure and train/test split of [Li, et al., 2018, Liu, et al., 2015]. For training, 190 samples are utilized, and all remaining 1,096 images are used for testing. The numerical results of recognition using Ed and Cd are shown in Table 5.5. [Liu, et al., 2019] compare it with the state-of-the-art methods that use MMD regularizations etc., to remove the affects of lighting conditions [Li, et al., 2018, Liu, et al., 2015]. The advantage of their framework about factoring out lighting conditions is shown by the improved accuracy 90.1%, while the best baseline achieves an accuracy of 86.6%. Although the lighting conditions can be modeled very well with a Lambertian model, [Liu, et al., 2019] choose to use a generic neural network to learn invariant features, so that the proposed method can be readily applied to other applications. In terms of removing s, their framework can filter the lighting conditions since the accuracy of classifying s from d drops from 56.5% to 26.2% (halved), as shown in Table 5.5. Note that 20% is a chance performance for 5 class illumination, when the s is totally dispelled. This can also be seen in the visualization of two-dimensional embeddings of the original x. One can see that the original images are clustered based on the lighting conditions. The clustering based on CNN features are almost well according to the identity, but still affected by the lighting and results in a “black center”.
Table 5.5 Classification accuracy comparisons. Accuracy on Extended YaleB Method
(y | d)
(s | d)
(y | l)
(s | l)
Original x as d
78.0%
96.1%
−
−
Li
82%
−
−
−
Louizos
84.6%
56.5%
−
−
Daniel
86.6%
47.9%
Proposed
90.1%
26.2%
− 8.7%
30.5%
Disentanglement For Discriminative Visual Recognition 171 As soon as removing the lighting variations via FLF, images are distributed almost only according to the identity of each subject. Table 5.5 shows classification accuracy comparisons. Expecting the accuracy of classifying y or s from l to be a lower value. A better discriminative representation d has a higher accuracy of classifying y and a lower accuracy in predicting s. *Following the setting in [Liu, et al., 2015], [Liu, et al., 2019] utilize the Logistics Regression classifier for the accuracy of predicting the s and using original x to predict y. The to be dispelled s represents source dataset (i.e., domain) on DIGITS, and represents lighting condition on Extened YaleB, both are main-task irrelevant semantic variations. [Liu, et al., 2019] evaluate the desired makeup-invariance property of the learned discriminative representation on three makeup benchmarks. To be detailed, [Liu, et al., 2019] train their framework using CelebA dataset [Liu, et al., 2018] which is a face dataset with 202,599 face images from more than 10K subjects, with 40 different attribute labels where each label is a binary value. [Liu, et al., 2019] adapt the Ed and Cd from VGG-16 [Perarnau, et al., 2016], and the extracted d in testing stage are directly utilized for the open-set recognitions [Liu, et al., 2017], without fine-tuning on the makeup datasets as the VGG baseline method. PR 2017 Dataset [Sharmanska, et al., 2012] collected 406 makeup and non-makeup images from the Internet of 203 females. TCSVT 2014 dataset [Guo, et al., 2013] incorporate 1,002 face images. FAM dataset [Hu, et al., 2013] involves 222 males and 297 females, with 1,038 images belonging to 519 subjects in total. It is worth noticing that all these images are acquired under uncontrolled condition. [Liu, et al., 2019] follow the protocol provided in [LeCun, et al., 1998], and the rank-1 average accuracy of FLF and state-ofthe-art methods are reported in Table 5.6 as quantitative evaluation. The performance of [LeCun, et al., 1998], VGG-baseline and FLF are benefited from the large-scale training dataset in CelebA. Note that the CelebA used in FLF and baseline, and even larger MS-Celeb-1M databases [Guo, et al., 2016] used in [LeCun, et al., 1998] have incorporated several makeup variations. With the prior information about the makeup recognition datasets, [Liu, et al., 2019] systematically enforce the network to be invariant to the makeup-related attributes, which incorporate both the id-relevant variations (e.g., hair color) and id-irrelevant variations (e.g., smiling/not). Dispelling these id-relevant attributes usually degrades the recognition accuracy in original CelebA dataset, but achieves better generalization ability on makeup face recognition datasets. Since these attributes are very likely to be changed for the subjects in makeup face recognition datasets, the FLF can extract more discriminative feature for better generalization ability.
Dataset Methods
VGG Proposed
Acc | TPR
68.0% | −
92.3% | 38.9%
82.7% | 34.7%
94.6% | 45.9%
Methods
[Sharmanska, et al., 2012]
[LeCun, et al., 1998]
VGG
Proposed
[LeCun, et al., 1998]
[Louizos, et al., 2015]
TCSVT2014
PR2017
96.2% | 71.4%
84.5% | 59.5%
94.8% | 65.9%
82.4% | −
Acc | TPR
Proposed
VGG
[Kushwaha, et al., 2018]
[Hu, et al., 2013]
Methods
FAM
Table 5.6 Comparisons of the rank-1 accuracy and TPR@FPR=0.1% on three makeup datasets.
91.4% | 58.6%
80.8% | 48.3%
82.6% | −
62.4% | −
Acc | TPR
172 Recognition and Perception of Images
13
14
15
16
17
18
19
20
Arched Eyebr
Bushy Eyebr
Attractive
Eyes Bags
Bald*
Bangs
Black Hair*
Blond Hair*
Brown Hair*
2
3
4
5
6
7
8
9
10
12
11
5’O Shadow
1
Att.Id
Attr.Def
Att.Id
Cheekbones
Makeup
Goatee
Eyeglasses
Double Chin
Chubby
Blurry
Big Nose
Gray Hair* Big Lips
Attr.Def
30
29
28
27
26
25
24
23
22
21
Att.Id
Mustache
Rosy Cheeks
Hairline
Pointy Nose
Pale Skin
Oval Face
No Beard
Mouth Open
Narrow Eyes
Male
Att.Def
40
39
38
37
36
35
34
33
32
31
Att.Id
Young*
Necktie
Necklace
Lipstick
Hat
Earrings
Wavy Hair
Straight Hair
Smiling
Sideburns
Att.Def
Table 5.7 Summary of the 40 face attributes provided with the CelebA and LFWA dataset. We expect the network learns to be invariant to the bolded and italicized attributes for our makeup face recognition task. *We noticed the degrades of recognition accuracy in CelebA dataset when dispelling these attributes.
Disentanglement For Discriminative Visual Recognition 173
174 Recognition and Perception of Images By utilizing the valuable side labels (both main-task and attributes) in CelebA in a controllable way, [Liu, et al., 2019] achieve more than 10% improvement over the baseline, and outperforms STOA by ≥ 5.5% w.r.t TPR@FPR=0.1% in all datasets. [Liu, et al., 2019] also take the open-set identification experiments in CelebA with an ID-independent 5-fold protocol. In Table 5.7, [Liu, et al., 2019] have shown which 18 attributes can increase the generalization in CelebA, while 6 attributes will degrade the accuracy in CelebA while improving the performance in Makeup face recognition. The accuracy of FLF on CelebA after dispelled 18 kinds of attributes is significantly better than its baselines. The VGG does not utilize the attribute label, the 19-head is a typical multi-task learning framework which can be distracted by task-irrelevant s. Inversely, [Liu, et al., 2019] can flexibly change the main-task as attribute recognition and dispel the identity information. As shown in Table 5.8, FLF outperforms the previous methods with a relatively simple backbone following the standard evaluation protocol of CelebA and LFWA [Liu, et al., 2018] benchmarks. To further verify the quality of semantic variations dispelling, [Liu, et al., 2019] acquire some of the conditional generated images in Figure 5.8, given the input samples from the test set of CelebA. Without changing attributes vector s, these three complementary parts maintain the most of information to reconstruct the input samples. Benefiting from the information encoded in the latent variation vector l, the background can be well maintained. [Liu, et al., 2019] are able to change any semantic attributes incorporated in s, while keeping the d and l for identity-preserved attributes transform, which archives higher naturalness than previous pixel space IcGAN [Peng, et al., 2017]. The methods commonly used in vision to assess the visual quality of the generated images (e.g., Markovian discriminator [Jayaraman, et al., 2014]) could totally be applied on top of their model for better texture, despite we do not focus on that.
Table 5.8 Face recognition accuracy on CelebA dataset. Methods
Rank-1 accuracy
VGG
85.4%
19-head (1ID+18attr)
81.1%
FLF
92.7% (↑22.7%)
Disentanglement For Discriminative Visual Recognition 175
Input
Recon
Black
Brown
Blond
Un-smile
Smile
Figure 5.8 Using MI-FLF to swap some face attributes, while keeping their ID and background. Given an input image (left), we show its reconstruction result without attributes changes and with hair color or expression changes.
The 5 hours training takes on a K40 GPU is 3× faster than pixel-level IcGAN [Peng, et al., 2017], without the subsequent training using the generated image for recognition and the inference time in the testing phase is the same as VGG. The disgust face in the wild (DFW) dataset [Lample, et al., 2017] is a recently leased benchmark, which has 11,157 images from 1,000 subjects. The mainstream methods usually choose CelebA as pre-training dataset, despite the fact that it has a slightly larger domain gap with CelebA than these makeup datasets. In Table 5.9, [Liu, et al., 2019] show the FLF can largely improve the VGG baseline by 18% and 20.9% w.r.t GAR@1%FAR and [email protected]%FAR, respectively. It can also be used as a pre-training scheme (FLF+MIRA) to Table 5.9 Face attribute recognition accuracy on CelebA and LFWA dataset. Two datasets are trained and tested separately. Methods
backbone
CelebA
LFWA
[Liu, et al., 2018]
AlexNet
87.30%
83.85%
[Louizos, et al., 2015]
VGG-16
91.20%
–
[Liu, et al., 2018]
InceptionResNet
87.82%
83.16%
[He, et al., 2018]
ResNet50
91.81%
85.28%
FLF
VGG-16
93.26%
87.82%
176 Recognition and Perception of Images Table 5.10 Face recognition on DFW dataset. Methods
@1%FAR
@0.1%FAR
VGG
33.76%
17.73%
FLF
51.78% (↑18.02%)
38.64% (↑20.91%)
MIRA
89.04%
75.08%
FLF+MIRA
91.30% (↑2.26%)
78.55% (↑3.47%)
complement the state-of-the-art methods for better performance. The face recognition performance is compared in Table 5.10.
5.7 Discussion 5.7.1 Independent Analysis The three complementary parts are expected to be uncorrelated to each other. The s is marginally independent to the d and s, since its short code cannot incorporate the other information. [Liu, et al., 2019] learn the d to be discriminative to the main recognition task and marginally independent to s by maximizing the certainty of making main task predictions and uncertainty of inferring the semantic variations given the d. Given the l, minimizing the certainty of making main task (y) predictions can makes l marginally independent to the d and some of the task-dependent s. Considering the complexity of the framework, [Liu, et al., 2019] do not strictly require the learned l to be marginally independent to task-irrelevant s. The ground truth label of l also does not exist in the datasets to supervise the d to be marginally independent to latent variation l. Instead, [Liu, et al., 2019] limit the output dimension of Ed and El as an information bottleneck to implicitly require d and l incorporate little unexpected information [Kingma, et al., 2014, Theis, et al., 2015]. Additionally, a reconstruction loss is utilized as the complementary constraint, which avoids the d and l containing nothing.
5.7.2 Equilibrium Condition Several trade-off parameters constrained between 0 and 1 are used to balance the judiciously selected loss functions. The El is trained to minimize the −Cl + λrec , where the λ is used to weight the relevance of the latent representation with the class label, and the quality of reconstruction.
(
)
Disentanglement For Discriminative Visual Recognition 177 The Ed is updated by minimizing the (Dis − α Dis + βrec ). The rec works as a complementary constraint, the β is usually given a relatively small value. [Liu, et al., 2019] omit this term for simplicity to analyze the function of α. The objective of semantic variation dispelling can be formulated as:
min max Ed ,Cd
x ,s , y ~q ( x ,s , y )
Dis
− log pCd ( y | Ed ( x )) + α log pDis (s | Ed ( x ))
(5.18)
[Liu, et al., 2019] explain how the task of d preserving and s eliminating are balanced in the game under non-parametric assumptions (i.e., assume a model with infinite capacity). Two scenarios are discussed where s is dependent/independent to y. Considering that both the Ed and Cd use d which is transformed deterministically from x, [Liu, et al., 2019] substitute x with d and define a joint q(d , x , s , y )dx = q( x , s , y ) p (d | x )dx . Since the ∫ ∫ E is a deterministic transformation and thus the p (d | x ) is merely a delta function denoted by δ(·). Then q(d , s , y ) = q( x , s , y )δ ( E ( x ) = d )dx, which ∫ depends on the transformation defined by the E . Intuitively, d absorbs the distribution q(d , s , y ) =
x
Ed
x
Ed
d
d
x
d
randomness in x and has an implicit distribution of its own. [Liu, et al., 2019] equivalently rewrite the Eq. 5.18 as:
min max Ed ,Cd
Dis
d ,s , y ~q( d ,s , y )
− log pCd ( y | d ) + α log pDis (s | d )
(5.19)
To analyze the equilibrium condition of the new objective Eq. 5.19, [Liu, et al., 2019] first deduce the optimal Cd and Dis for a given Ed, then prove its global optimality. For a given fixed Ed, the optimal Cd outputs * (s | d ) = q(s | d ). pC*d ( y | d ) = q( y | d ), and the optimal Dis corresponds to pDis [Liu, et al., 2019] use the fact that the objective is functionally convex w.r.t. each distribution, and by taking the variations, we obtain the stationary point for pCd and pDis as a function of q(d , s , y ). The optimal pC*d ( y | d ) and * (s | d ) given in Claim 1 are both functions of the encoder E . Thus, by pDis d * into the Eq. 5.19, it can be a minimization problem plugging pC*d and pDis only w.r.t. the Ed with the following form:
178 Recognition and Perception of Images
min Ed
q( d ,s , y )
−logq( y | d ) + α logq(s | d ) = min H (q( y | d )) − α H (q(s | d )) Ed (5.20)
where the H (q( y | d )) and H (q(s | d )) are the conditional entropy of the distribution q( y | d ) and q(s | d ) respectively. As we can see, the objective consists of two conditional entropies with different signs. Minimizing the first term leads to increasing the certainty of predicting y based on d. In contrast, minimizing the second term with the negative sign amounts to maximizing the uncertainty of inferring s based on d, which is essentially filtering out any information about semantic variations from the discriminative representation. • For the cases where the attribute s is entirely independent of main recognition task, these two terms can reach the optimum simultaneously, leading to a win-win equilibrium. For instance, with the lighting effects on a face image removed, we can better identify the subject. With sufficient model capacity, the optimal equilibrium solution would be the same regardless of the value of α. • We may also encounter cases where the two objectives compete with each other. For example, learning a taskdependent semantic variation dispelled representation may harm the original main-task performance. Hence the optimality of these two entropies cannot be achieved at the same time and the relative strengths of the two objectives in the final equilibrium are controlled by α.
5.8 Conclusion How to extract a feature representation that can not only be informative to the main recognition task, but also irrelevant to some specific notorious factors is an important objective in visual recognition. This chapter systematically summarized the possible factors and introduced two practical solutions to achieve the disentanglement in a controllable manner. Specifically, [Liu, et al., 2017] derive the (N+M)-tuplet clusters loss and combine it with softmax loss in a unified two-branch FC layer joint metric learning CNN architecture to alleviate the attribute variations introduced by different identities on FER. The efficient identity-aware
Disentanglement For Discriminative Visual Recognition 179 negative-mining and online positive-mining scheme are employed. After evaluating performance on the posed and spontaneous FER dataset, [Liu, et al., 2017] show that the proposed method outperforms the previous softmax loss-based deep learning approaches in its ability to extract expression- related features. More appealing, the (N+M)-tuplet clusters loss function has clear intuition and geometric interpretation for generic applications. [Liu, et al., 2019] present a solution to extract discriminative representation inheriting the desired invariance in a controllable way, without the paired semantic transform example and latent labels. Its recognition does not need generated image as training data. As a result, [Liu, et al., 2019] show that the invariant representation is learned, and the three parts are complementary to each other. Considering both the labeled semantic variation and the unlabeled latent variation can be a promising developing direction for many real-world applications.
References Arora S., Bhaskara A., Ge, R., and Ma T. ‘Provable bounds for learning some deep representations’. In International Conference on Machine Learning, 2014, P.584-592. Baltrušaitis T., Mahmoud M., and Robinson P. ‘Cross-dataset learning and person-specific normalisation for automatic action unit detection’, In11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), 2015, P.1-6. Bao J., Chen D., Wen F., Li H., and Hua G. ‘Towards open-set identity preserving face synthesis’, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, P.6713-6722. Barsoum E., Zhang C., Ferrer C.C., and Zhang Z. ‘Training deep networks for facial expression recognition with crowd-sourced label distribution’, In Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016, P.279-283. Bengio Y.J. ‘Learning deep architectures for AI’, 2009, 2, (1), P.1-127. Cao J., Katzir O., Jiang P., Lischinski D., Cohen-Or D., Tu C., and Li, Y. ‘Dida: Disentangled synthesis for domain adaptation’, In arXiv preprint, 2018 arXiv:1805.08019. Che T., Liu X., Li S., Ge Y., Zhang R., Xiong C., and Bengio Y. ‘Deep verifier networks: Verification of deep discriminative models with deep generative models’, arXiv preprint, 2019, arXiv:1911.07421. Chopra S., Hadsell R., and Le Cun Y. ‘Learning a similarity metric discriminatively, with application to face verification’, In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, P.539-546.
180 Recognition and Perception of Images Dhall A., Ramana Murthy O., Goecke, R., Joshi J., and Gedeon T. ‘Video and image based emotion recognition challenges in the wild: Emotiw 2015’, In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, 2015, P.423-426. Ding S., Lin L., Wang G., and Chao H. ‘Deep feature learning with relative distance comparison for person re-identification’, Pattern Recognition, 48(10), 48, (10), P.2993-3003. Dong H., Neekhara P., Wu C., and Guo Y. ‘Unsupervised image-to-image translation with generative adversarial networks’, arXiv preprint, 2017, arXiv:1701.02676. Edwards H., and Storkey A. ‘Censoring representations with an adversary’, arXiv preprint, 2015, arXiv:1511.05897. Elgammal A., and Lee C.-S. ‘Separating style and content on a nonlinear manifold’, In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. Fu Y., Hospedales T.M., Xiang T., Gong S. ‘Learning multimodal latent attributes’, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 36, (2), pp. 303-316. Georghiades A.S., Belhumeur P.N., Kriegman D. ‘From few to many: Illumination cone models for face recognition under variable lighting and pose’, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23, (6), P.643-660. Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., and Bengio Y. In Advances in Neural Information Processing Systems, 2014, P.2672-2680. Goodfellow I.J., Erhan D., Carrier P.L., Courville A., Mirza M., Hamner B., Cukierski W., Tang Y., Thaler D., and Lee D.-H. ‘Challenges in representation learning: A report on three machine learning contests’, In International Conference on Neural Information Processing, 2013, P.117-124. Gross R., Matthews I., Cohn J., Kanade T., Baker S. J. ‘Multi-pie’, Image and Vision Computing, 2010, 28, (5), P.807-813. Guo G., Wen L., Yan S. ‘Face authentication with makeup changes’, IEEE Transactions on Circuits and Systems for Video Technology, 2013, 24, (5), P.814-825. Guo Y., Zhang L., Hu Y., He X., and Gao J. ‘Ms-celeb-1m: A dataset and benchmark for large-scale face recognition’, In European Conference on Computer Vision, 2016, P.87-102. Hadad N., Wolf L., and Shahar M. ‘A two-step disentanglement method’, ‘Book A two-step disentanglement method’ In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, P.772-780 He K., Fu Y., Zhang W., Wang C., Jiang Y.-G., Huang F., and Xue X. ‘Harnessing Synthesized Abstraction Images to Improve Facial Attribute Recognition’, in International Joint Conferences on Artificial Intelligence, 2018, P.733-740
Disentanglement For Discriminative Visual Recognition 181 Horiguchi S., Ikami D., and Aizawa K. ‘Significance of softmax-based features over metric learning-based features’, In International Conference on Learning Representations, 2016. Hu G., Hua Y., Yuan Y., Zhang Z., Lu Z., Mukherjee S.S., Hospedales T.M., Robertson N.M., and Yang Y. ‘Attribute-enhanced face recognition with neural tensor fusion networks’, in In Proceedings of the IEEE International Conference on Computer Vision, 2017, P.3744-3753. Hu J., Ge Y., Lu J., and Feng X. ‘Makeup-robust face verification’, in ‘Book Makeuprobust face verification’ IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, P.2342-2346. Hu Q., Szabó A., Portenier T., Favaro P., and Zwicker M. ‘Disentangling factors of variation by mixing them’, in In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, P.3399-3407. Huang R., Zhang S., Li T., and He R. ‘Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis’, In Proceedings of the IEEE International Conference on Computer Vision, 2017, P.2439-2448. Jain S., Hu C., and Aggarwal J.K. ‘Facial expression recognition with temporal modeling of shapes’, In 2011 IEEE International Conference on Computer Vision Workshops, 2011, P.1642-1649. Jayaraman D., Sha F., and Grauman K. ‘Decorrelating semantic visual attributes by resisting the urge to share’, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, P.1629-1636. Jha A.H., Anand S., Singh M., and Veeravasarapu V. ‘Disentangling factors of variation with cycle-consistent variational auto-encoders’, In European Conference on Computer Vision, 2018, P.829-845. Jia Y., Shelhamer E., Donahue J., Karayev S., Long J., Girshick R., Guadarrama S., and Darrel, T. ‘Caffe: Convolutional architecture for fast feature embedding’, in Proceedings of the 22nd ACM International Conference on Multimedia, 2014, P.675-678. Jiang B., Valstar M.F., and Pantic M. ‘Action unit detection using sparse appearance descriptors in space-time video volumes’, In Face and Gesture, 2011, P.314-321. Jiang H., Wang R., Shan S., Yang Y., and Chen X. ‘Learning discriminative latent attributes for zero-shot classification’, In Proceedings of the IEEE International Conference on Computer Vision, 2017, P.4223-4232. Kaya H., and Salah A.A. ‘Combining modality-specific extreme learning machines for emotion recognition in the wild’, Journal on Multimodal User Interfaces, 2016, 10, (2), P.139-149. Kim B.-K., Roh J., Dong S.-Y., and Lee S.-Y. ‘Hierarchical committee of deep convolutional neural networks for robust facial expression recognition’, Journal on Multimodal User Interfaces, 2016, 10, (2), P.173-189. Kingma D.P., and Ba J.J. ‘Adam: A method for stochastic optimization’, arXiv preprint 2014 arXiv:1412.6980.
182 Recognition and Perception of Images Kingma D.P., Mohamed S., Rezende D.J., and Welling M. ‘Semi-supervised learning with deep generative models’, in In Advances in Neural Information Processing Systems, 2014, P.3581-3589. Kingma D.P., and Welling M. ‘Auto-encoding variational bayes’, arXiv preprint arXiv:1312.6114 (2013). Kushwaha V., Singh M., Singh R., Vatsa M., Ratha N., and Chellappa R. ‘Disguised faces in the wild’, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, P.1-9. Lample G., Zeghidour N., Usunier N., Bordes A., Denoyer L., and Ranzato M.A. ‘Fader networks: Manipulating images by sliding attributes’, In Advances in Neural Information Processing Systems, 2017, P.5967-5976. LeCun Y., Bottou L., Bengio Y., and Haffner P. ‘Gradient-based learning applied to document recognition’, Proceedings of the IEEE. 1998, 86, (11), P.2278-2324. Li Y., Song L., Wu X., He R., and Tan T. ‘Anti-makeup: Learning a bi-level adversarial network for makeup-invariant face verification’, In Thirty-Second AAAI Conference on Artificial Intelligence, 2018, P.23-39. Li Y., Swersky K., and Zemel R. ‘Learning unbiased features’, arXiv preprint 2014 arXiv:1412.5244. Li Y., Tian X., Gong M., Liu Y., Liu T., Zhang K., and Tao D. ‘Deep domain generalization via conditional invariant adversarial networks’, In Proceedings of the European Conference on Computer Vision, 2018, P.624-639. Li Y., Wang R., Liu H., Jiang H., Shan S., and Chen X. ‘Two birds, one stone: Jointly learning binary code for large-scale face image retrieval and attributes prediction’, In Proceedings of the IEEE International Conference on Computer Vision, 2015, P.3819-3827. Li Z., Chang S., Liang F., Huang T.S., Cao L., and Smith J.R. ‘Learning locally-adaptive decision functions for person verification’, In Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, P.3610-3617. Liu H., Tian Y., Yang Y., Pang L., and Huang T. ‘Deep relative distance learning: Tell the difference between similar vehicles’, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, P.2167-2175. Liu M.-Y., Breuel T., and Kautz J. ‘Unsupervised image-to-image translation networks’, In Advances in Neural Information Processing Systems, P.700-708. Liu X. ‘Research on the technology of deep learning based face image recognition’, Thesis, 2019. 148p. Liu X., Fan F., Kong L., Diao Z., Xie W., Lu J., and You J. ‘Unimodal Regularized Neuron Stick-breaking for Ordinal Classification’, Neurocomputing, 2020. Liu X., Ge Y., Yang C., and Jia P. ‘Adaptive metric learning with deep neural networks for video-based facial expression recognition’, Journal of Electronic Imaging, 2018, 27, (1), P.13-22 Liu X., Guo Z., Li S., Kong L., Jia P., You J., and Kumar B. ‘Permutation-invariant feature restructuring for correlation-aware image set-based recognition’, In Proceedings of the IEEE International Conference on Computer Vision, 2019, P.4986-4996.
Disentanglement For Discriminative Visual Recognition 183 Liu X., Guo Z., You J., and Kumar B. ‘Attention Control with Metric Learning Alignment for Image Set-based Recognition’, arXiv preprint arXiv:1908.01872, 2019. Liu X., Guo Z., You J., Kumar B.V. ‘Dependency-Aware Attention Control for Image Set-Based Face Recognition’, IEEE Transactions on Information Forensics and Security, 2019, 15, P.1501-1512. Liu X., Han X., Qiao Y., Ge Y., Li S., and Lu J. ‘Unimodal-uniform constrained wasserstein training for medical diagnosis’, In Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019, P.274-298. Liu X., Ji W., You J., Fakhri G., and Woo J. ‘Severity-Aware Semantic Segmentation with Reinforced Wasserstein Training’, in Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020, P.274-298. Liu X., Kong L., Diao Z., and Jia P. ‘Line-scan system for continuous hand authentication’, Optical Engineering 2017, 56, (3), P.331. Liu X., Kumar B.V., Ge Y., Yang C., You J., and Jia P. ‘Normalized face image generation with perceptron generative adversarial networks’, In 2018 IEEE 4th International Conference on Identity, Security, and Behavior Analysis, 2018, P.1-8. Liu X., Kumar B.V., Jia P., and You J. ‘Hard negative generation for identity-disentangled facial expression recognition’, Pattern Recognition, 2019, 88, P.1-12. Liu X., Li S., Kong L., Xie W., Jia P., You J., and Kumar B. ‘Feature-level frankenstein: Eliminating variations for discriminative recognition’, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, P.637-646. Liu X., Li S., Guo Z., Lu J., You J., Xing F., Fakhri G., and Woo J. ‘Adversarial Unsupervised Domain Adaptation under Covariant and Label Shift: Infer, Align and Iterate’, In European Conference on Computer Vision, 2020. Liu X., Li Z., Kong L., Diao Z., Yan J., Zou Y., Yang C., Jia P., and You J. ‘A joint optimization framework of low-dimensional projection and collaborative representation for discriminative classification’, In 2018 24th International Conference on Pattern Recognition, 2018, P.1493-1498. Liu X., Vijaya Kumar B., Yang C., Tang Q., and You J. ‘Dependency-aware attention control for unconstrained face recognition with image sets’, In Proceedings of the European Conference on Computer Vision, 2018, P.548-565. Liu X., Vijaya Kumar B., You J., and Jia P. ‘Adaptive deep metric learning for identity-aware facial expression recognition’, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, P.20-29. Liu X., Zou Y., Che T., Ding P., Jia P., You J., and Kumar B. ‘Conservative wasserstein training for pose estimation’, In Proceedings of the IEEE International Conference on Computer Vision, 2019, P.8262-8272 Liu X., Zou Y., Kong L., Diao Z., Yan, J., Wang J., Li, S., Jia P., and You J. ‘Data augmentation via latent space interpolation for image classification’, In 2018 24th International Conference on Pattern Recognition, 2018, P.728-733.
184 Recognition and Perception of Images Liu X., Zou Y., Song Y., Yang C., You J., and K Vijaya Kumar B. ‘Ordinal regression with neuron stick-breaking for medical diagnosis’, In Proceedings of the European Conference on Computer Vision, 2018, Р.308-329. Liu Y., Wang Z., Jin H., and Wassell I. ‘Multi-task adversarial network for disentangled feature learning’, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, P.3743-3751. Liu Y., Wei F., Shao J., Sheng L., Yan J., and Wang X. ‘Exploring disentangled feature representation beyond face identification’, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, P.2080-2089. Liu Z., Luo P., Wang X., and Tang X. ‘Deep learning face attributes in the wild’, in ‘Book Deep learning face attributes in the wild’ In Proceedings of the IEEE International Conference on Computer Vision, 2015, P.3730-3738. Louizos C., Swersky K., Li Y., Welling M., and Zemel R. ‘The variational fair autoencoder’, arXiv preprint arXiv:1511.00830, 2015. Lu Y., Kumar A., Zhai S., Cheng Y., Javidi T., and Feris R. ‘Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification’, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, P.5334-5343. Lucey P., Cohn J.F., Kanade T., Saragih J., Ambadar Z., and Matthews I. ‘The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression’, In IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, 2010, P.94-101. Maaten L.v.d., and Hinton G.J. ‘Visualizing data using t-SNE’, Journal of Machine Learning Research, 2008, 9, (Nov), P.2579-2605. Makhzani A., Shlens J., Jaitly N., Goodfellow I., and Frey B. ‘Adversarial autoencoders’, arXiv preprint arXiv:1511.05644. 2015. Mao Q., Rao Q., Yu Y., and Dong M. ‘Hierarchical Bayesian theme models for multipose facial expression recognition’, IEEE Transactions on Multimedia. 2016, 19, (4), P.861-873 Mathieu M.F., Zhao J.J., Zhao J., Ramesh A., Sprechmann P., and LeCun Y. ‘Disentangling factors of variation in deep representation using adversarial training’, In Advances in Neural Information Processing Systems. 2016, P.5040-5048. Mollahosseini A., Chan D., and Mahoor M.H. ‘Going deeper in facial expression recognition using deep neural networks’, In 2016 IEEE Winter Conference on Applications of Computer Vision, 2016, P.1-10. Mollahosseini A., Hasani B., Salvador M.J., Abdollahi H., Chan D., and Mahoor M.H. ‘Facial expression recognition from world wild web’, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, P.58-65. Netzer Y., Wang T., Coates A., Bissacco A., Wu B., and Ng A.Y. ‘Reading digits in natural images with unsupervised feature learning’, NIPS Workshop on Deep Learning and Unsupervised Feature Learning. 2011.
Disentanglement For Discriminative Visual Recognition 185 Ng H.-W., Nguyen V.D., Vonikakis V., and Winkler S. ‘Deep learning for emotion recognition on small datasets using transfer learning’, In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. 2015, P.443-449. Pantic M., Valstar M., Rademaker R., and Maat L. ‘Web-based database for facial expression analysis’, In IEEE International Conference on Multimedia and Expo, 2005, 5p. Peng P., Tian Y., Xiang T., Wang Y., Pontil M., Huang T. ‘Joint semantic and latent attribute modelling for cross-class transfer learning’, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40, (7), P.1625-1638. Peng X., Yu X., Sohn K., Metaxas D.N., and Chandraker M. ‘Reconstruction-based disentanglement for pose-invariant face recognition’, In Proceedings of the IEEE International Conference on Computer Vision, 2017, P.1623-1632. Perarnau G., Van De Weijer J., Raducanu B., and Álvarez J.M. ‘Invertible conditional gans for image editing’, arXiv preprint arXiv:1611.06355. 2016. Sharmanska V., Quadrianto N., and Lampert C.H. ‘Augmented attribute representations’, in ‘Book Augmented attribute representations’, In European Conference on Computer Vision, P.242-255. Shen J., Zafeiriou S., Chrysos G.G., Kossaifi J., Tzimiropoulos G., and Pantic M. ‘The first facial landmark tracking in-the-wild challenge: Benchmark and results’, In Proceedings of the IEEE International Conference on Computer Vision Workshops 2015, P.50-58. Shi H., Yang Y., Zhu X., Liao S., Lei Z., Zheng W., and Li S.Z. ‘Embedding deep metric for person re-identification: A study against large variations’, In European Conference on Computer Vision, 2016, P.732-748. Simonyan K., and Zisserman A. ‘Very deep convolutional networks for large-scale image recognition’, arXiv preprint arXiv:1409.1556, 2014. Sohn K. ‘Improved deep metric learning with multi-class n-pair loss objective’, in In Advances in neural information processing systems, 2016, P.857-1865. Sun B., Li L., Zhou G., Wu X., He J., Yu L., Li D., and Wei Q. ‘Combining multimodal features within a fusion network for emotion recognition in the wild’, In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, 2015, P.497-502. Sun Y., Chen Y., Wang X., and Tang X. ‘Deep learning face representation by joint identification-verification’, In Advances in Neural Information Processing Systems 2014, P.1988-1996. Sun Y., Ren L., Wei Z., Liu B., Zhai Y., and Liu S. ‘A weakly supervised method for makeup-invariant face verification’, Pattern Recognition 2017, 66, P.153-159. Tenenbaum J.B., and Freeman W.T. ‘Separating style and content with bilinear models’, Neural Computation, 2000, 12, (6), P.1247-1283. Theis L., Oord A.v.d., and Bethge M. ‘A note on the evaluation of generative models’, arXiv preprint arXiv:1511.01844, 2015. Tian Y.-L., Kanade T., and Cohn J.F. ‘Facial expression analysis’, In Handbook of Face Recognition eds. Stan Z. Li and Anil K. Jain. Springer, 2005, P.247-275.
186 Recognition and Perception of Images Tian Y., Peng X., Zhao L., Zhang S., and Metaxas D.N. ‘CR-GAN: learning complete representations for multi-view generation’, arXiv preprint arXiv:1806.11191. 2018. Tishby N., and Zaslavsky N. ‘Deep learning and the information bottleneck principle’, In 2015 IEEE Information Theory Workshop, 2015, P.1-5. Tzeng E., Hoffman J., Saenko K., and Darrell T. ‘Adversarial discriminative domain adaptation’, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, P.7167-7176. Tzimiropoulos, G., and Pantic M. ‘Optimization problems for fast aam fitting in-the-wild’, In Proceedings of the IEEE International Conference on Computer Vision, 2013, P.593-600. Wang Q., Zuo W., Zhang L., and Li P. ‘Shrinkage expansion adaptive metric learning’, In European Conference on Computer Vision 2014, P. 456-471. Xiao T., Hong J., and Ma J. J.a.p.a.: ‘Dna-gan: Learning disentangled representations from multi-attribute images’, arXiv preprint arXiv:1711.05415, 2017. Xie Q., Dai Z., Du Y., Hovy E., and Neubig G. ‘Adversarial invariant feature learning’, in ‘Book Adversarial invariant feature learning’ arXiv preprint arXiv:1605.09782. 2017, P.585-596. Yang C., Liu X., Tang Q., and Kuo C.-C. ‘Towards Disentangled Representations for Human Retargeting by Multi-view Learning’, arXiv preprint arXiv:1912.06265. 2019. Yang C., Song Y., Liu X., Tang Q., and Kuo C.-C. ‘Image inpainting using blockwise procedural training with annealed adversarial counterpart’, arXiv preprint arXiv:1803.08943, 2018. Yao A., Shao J., Ma N., and Chen Y. ‘Capturing au-aware facial features and their latent relations for emotion recognition in the wild’, In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, 2015, P.451-458. Yi D., Lei Z., Liao S., and Li S.Z. ‘Learning face representation from scratch’, arXiv preprint arXiv:1411.7923. 2014. Yu Z., and Zhang C. ‘Image based static facial expression recognition with multiple deep network learning’, In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction 2015, P.435-442. Yüce A., Gao H., and Thiran J.-P. ‘Discriminant multi-label manifold embedding for facial action unit detection’, In 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, 2015, P.1-6. Zafeiriou S., Papaioannou A., Kotsia I., Nicolaou M., and Zhao G. ‘Facial Affect In-The-Wild’, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, P.36-47. Zellinger W., Grubinger T., Lughofer E., Natschläger T., and Saminger-Platz S. ‘Central moment discrepancy (cmd) for domain-invariant representation learning’, arXiv preprint arXiv:1702.08811. 2017.
Disentanglement For Discriminative Visual Recognition 187 Zhang L., Tjondronegoro D., and Chandran V.J.N. ‘Random Gabor based templates for facial expression recognition in images with facial occlusion’, Neurocomputing 2014, 145, P.451-464. Zhang X., Zhou F., Lin Y., and Zhang S. ‘Embedding label structures for finegrained feature representation’, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, P.1114-1123. Zong Y., Zheng W., Huang X., Yan J., and Zhang T. ‘Transductive transfer lda with riesz-based volume lbp for emotion recognition in the wild’, In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, 2015, P.491-496.
6 Development of the Toolkit to Process the Internet Memes Meant for the Modeling, Analysis, Monitoring and Management of Social Processes Margarita G. Kozlova*, Vladimir A. Lukianenko and Mariia S. Germanchuk V.I. Vernadsky Crimean Federal University, Taurida Academy, Faculty of Mathematics and Computer Science Theory, Simferopol, Russia
Abstract
The complex networks are characterized by the higher-dimensional arrays. These are graphs containing more than tens of thousands of vertices. Such networks model the different experimental problems, and in particular the problems of social environments research. The most vivid dynamics in the social network is demonstrated in case of sending of Internet memes images. The projection of problematics of social networks research onto the Internet memes requires the more precise definition of notions, statement of problems, development of methods and algorithms as well as technologies focused on the whole complex of problems relating to the processes of distribution, management and prediction in the social network. The worked out algorithms of recognition of text on the meme images; the development of memes base and their tagging; designing of system that detects new memes; developing of the meme ‘life cycle’ in the social network are covered in depth. The system of social polling and forecasts which is to work in the real time on the basis of ever-expanding base of Internet meme classes is developed on the ground of provided technology. Keywords: Internet meme, social network, images recognition, meme life cycle
*Corresponding author: [email protected] Iftikhar B. Abbasov (ed.) Recognition and Perception of Images: Fundamentals and Applications, (189–220) © 2021 Scrivener Publishing LLC
189
190 Recognition and Perception of Images
6.1 Introduction The analysis of social networks is actively developed all over the world [Gubanov et al., 2010]. Social networks are the source of data on the life and interests of real people. The heightened interest in the Internet data is demonstrated by the different companies and research centers. Specialists use the data bulk of social networks to model the economic, political and social processes on various levels of development of mechanisms, tools and technologies influencing these processes [Gubanov et al., 2010]. It is possible to speak of the convergent memetics science within the context of primer approaches. Memetics, as a science, has only recently been formed but its specific methods and algorithms are already used in other branches of science. Its origination from the social sciences determines the ambiguity of concepts, definitions and terms. The formalization of memetics problems is possible only if limiting the application environment [Poliakov, 2009]. The applied definitions are to initiate the intelligent statement of problems, methods and algorithms of the definite application. We will use the following concepts hierarchy: 1) social environment; 2) social networks; 3) network community; 4) social Internet networks; 5) Internet memes. The dynamics of processes of the social Internet networks reflects the specifics of processes in the social Internet networks and social network at all. The memes and Internet memes visually reflect the dynamics of processes in the social Internet networks. Let’s provide the definition of social network by [Gubanov et al., 2010]: “The social structure consisting of numerous agents (subjects – individual or collective ones; e.g. individuums, families, groups, organizations) and presented multiple relations (total relations between agents; e.g. acquaintance, cooperation, communication) is regarded to be the social network on the qualitative level”. The network communities of online social networks may embrace either the whole network or part of it. The communities are classified according to the geographical, professional and other thematic criteria forming the localized interactive communities of the members of the social network as well as using and generating the common information resource (for example, the resource relating to the development, storage, processing, distribution and monitoring of Internet memes). Such resource reflects the processes of network self-development, mobilization of resource and its evolution. The different communications such as publications, messages, information on account (personal registration, interests, groups, demography, etc.); as well as information about the relationship of agents
Development of the Toolkit to Process the Internet Memes 191 represented in the form of a social graph that reflects the process of interaction of agents processing the Internet memes occur due to the active actions of network agents. The Internet memes as communication units which combine the verbal and non-verbal information in the forms of picture and text are of special interest. Internet memes are characterized by the commitment of Internet users with the purpose to surprise, provoke interest and drive to spread the obtained information over a network. There is a great variety of definitions of Internet meme that depends on the author and application environment. The term “meme” is assigned to Richard Dawkins (within the scope of the reductive approach, studying the memes as replicators of cultural information where memes are the elementary units interacting with each other). In particular, such an approach is applied by [Hofstadter, 1985]. The memes are identified through biological analogies such as psychic viruses [Brody, 2007]. Generally, the meme is defined as the mechanism of distribution of any idea. However, such loose definitions are not constructive. The semiotic approach to the memes analysis is used in a number of works. Due to this approach [according to Bart, 1989] the meme is regarded as myth, and the form of myth-concept – as some loose definition of reality understood within the scope of a dedicated community. The meme as myth may present the view of different groups. Such approaches are applied for the analysis of memes’ ideological functions but do not allow the study of the mechanisms of occurrence and distribution. The agent approach based on the spreading of viruses in vivo is applied by [Rashkoff, 2003]. Due to this approach the media viruses (memes) change the sensation of real events and allow state, corporate and other structures to drive the events within the scope of communication media. Moreover, the linguistic approach for which the study of form but not a meme is typical is used for memes analysis. In this case, the comic effect, emotional impact and combination of verbal and non-verbal elements promoting the communicative effect are of interest. The Internet memes, so-called demotivators (demotivated posters [Ukhova, 2014]) that have the form of image in the black frame and with the caption text below became widespread in the social networks. Demotivators are used for different purposes to structure social Internet environments and organize communication in the social networks. The Russian-language network VKontakte (more than 510 million registered users as of January 2020) with its self-expression rules is the most demonstrative one. The wall posting, rating, number of friends, followers and visitors create the virtual ghost of a particular user. The specific network language with jargon, slang and different levels of users’ reflection result in
192 Recognition and Perception of Images the clustering of network according to the levels of definiteness of demotivator understanding. The predictable interpretation of a demotivator meme depends on the structural components of image and adoption of a unique language by the Internet community. This language is a visual and symbolic one. It is primarily oriented toward the targeted audience with the high cultural interpretation (or base consumption needs of a community). In the work of [Lutovina, 2016] the functions of demotivators and pursued objectives of users regarded from the point of view of user and demotivator creators are described. The politically charged Internet meme is a means to express political opinion and indicative of the political rating (17 political memes fall within 100 English-language Internet memes) [Kanashina, 2017]. The information packed into the flow of Internet memes is meant to spread the influence within the network communities. It is necessary to generally consider the semantic position, textual aspect, word combinations, words and concepts as the elements of linguistic world-image to study the process of memes distribution and their visualization [Zhang & Zakharov, 2020]. The notion of semantic fields is used in the linguistics for it [Kutuzov & Kuzmenko, 2017], [Zakharov, 2019]. The tags corresponding to memes are visualized in the form of a tag cloud. The changes within the tag cloud for selected communities characterize the process of Internet memes distribution. It is also necessary to use other approaches to carry out more precise analysis. The enormous aggregations of information require the application of more simple algorithms to work in real time as well as algorithms hierarchy that solves the problems of more precise analysis. The geographical reference of meme to the distribution agent is possible in geosocial networks, i.e., location-based ones. The suitable tools use GPS data of mobile device and require access to information about the geoposition of the device. Network adjacency is used to search the friends in the social network, to make recommendations and obtain the user behavioral patterns. The survey of approaches to the network adjacency and application using the network adjacency to distribute the social content is described in the work of [Namiot & Makarychev, 2020]. This type of application may be used to study the process of memes distribution. The work objective is to work out the component of intellectualized system designed to extract the text from the images of Internet memes, to classify them, to use various network metrics for the correction of forecast of meme distribution as well as determine if this Internet meme belongs to the dedicated class. The Internet memes used in the work may be ascribed to the virus images which consist of the image of some pattern with the text. The
Development of the Toolkit to Process the Internet Memes 193 Russian-language social network VKontakte was selected as the model distribution environment.
6.2 Modeling of Internet Memes Distribution Considering the meme as the information product we will recognize it in the form of a system of signs, symbols, images and text. Following the principles of minimalism we will assign the meme to the conceptual core, i.e., the minimal meme object in the form of meme image, specifically localized melody, video, symbol, signs of various types, etc. Such memes represent the particular ideas, practices, technologies (actions), subjective views, productive and destructive ideological principles, stereotypes, etc. We will be limited by the meme in the form of local image (sign, symbol, pattern) and minimal text to carry out the structural modeling of process of memes conversion (creation, transmission, reconstruction, usage, storage). The meme structure is determined by the connection of its core with the network environment (external environment). The network structure (graph and some values assigned to its elements) is the meme usability. The mechanisms of memes distribution are of a viral nature and thus well adapted for distribution of destructive ideology. The control of distribution processes is an actual task. The problems occur in all elements of Internet memes distribution systems. The terminology adopted in the theoretical study of memes is the reflection of mathematical theory of information coding [Viner, 1958] in contrast to which it is not strictly formalized but features the incompleteness, uncertainty and redundancy as well as is mostly incorrect. The coding and decoding are not based on the alphabet and unambiguous definition of signs but operate by the signs which are ambiguously understood. Actually, no coding appears to be happening, and consequently not decoding but rather the restoring of sense of meme image message takes place. Moreover, the sense depends on many parameters and particularly on the meme consumer. It’s worth mentioning that the philosophical, philological and sociological approaches are not technological ones as the quantitative and qualitative measurable criteria and their evaluations are missing. An Internet meme is some container (framework) used for the delivery of needed information to the user. Such information is necessary for the consumer control. Namely, it is a part of mass culture that forms the reflexive (mindless) consumer. As young people are the main users of an Internet network, the meme technology is primarily focused on this active
194 Recognition and Perception of Images audience. The development of creative art object memes may resist the destructive Internet memes. Meme creation and distribution is reflected in the meme product by means of which the social activity is accomplished. There are two types of social activity [Osgood et al., 1957]: communicative and strategic. Communicative activity relates to the achievement of mutual understanding. And strategic activity is associated with consumer fraud aiming to realize the own interests. The autogenous memes (subject to conversion process) are adapted as viruses. They arouse the interest and fill the unresisting environment when the ideologies and strong dedicated factors are missing. The strong virus memes are used by the different groups to promote their interests and conceptions. The resource-based view of Internet network requires the selection of data flow on the network. The visual Internet memes thrive on the data flow. They are light, virus-like and require a little effort for clipping perception. The stream meme information is not turned into the knowledge. The Internet memes combine the internally incompatible images which already exist on a network. It is important that in this case the emotional feeling of knowledge acquiring is formed in the consumers. The memes are the carriers of social viruses. The theory of viral information distribution is based on the use of network resources and involvement of mathematical apparatus to model the process and find out the ways of information distribution. N. Hristakis and J. Fauler [Hristakis & Fauler, 2011] introduced the notion of “social contagion”. The study of viruses in social networks is particularly based on the modeling of wave fronts. The start of viruses is determined by the number of friends who liked the memes as well as the number of memes. In other words, the virus distribution depends on the number of consumers who “repost” it and on the frequency of meme occurrence in the information space of the particular consumer or whole Internet community. The meme becomes a virus if it is supported by the opinion leader who considers it as an actual one as well as when it is “bothering”. Just as in the neuronal webs its distribution depends on the threshold quantities and activation functions. The Internet meme features the life cycle but it is practically impossible to determine at which stage the meme as well as the economic unit is now. Each defined position requires its approach and the development of relevant mathematical models. The forecast of memes distribution is a difficult task. However, the control of memes flow is problematic without its solving. Due to the non linearity of Internet memes distribution problem, test memes play a certain role. Therefore, one of the tasks is to develop the base of memes and base of knowledge reflecting all processes of Internet meme technology.
Development of the Toolkit to Process the Internet Memes 195 Thus, it is necessary to attract the precedent-setting factors, machine learning technology and big data processing technology to ensure the analysis, modeling and functioning of Internet memes. The analog method is used in the scope of genetic approach in the theory of Internet memes (see the survey of [Poliakov, 2009]). The study of the dynamics of Internet memes distribution may be carried out on the basis of specialized methods and tools used to analyze the behavior of network communities. A lot of memes on a network reflect the events occurring on a net and in the real world. Many agents (either distributors [manufacturers] or consumers) are connected with the memes. The memes conversion process is accompanied with the informational data flows coming from the external social environment directly or as a result of interpretation of external environment events in the network community. The agent component of the agent informational structure of VK network is represented by the account agents, which may be interconnected, mutually responding to the external influence. Some memes may relate to one agent, and the agent response may vary with respect to one or another meme. There is a provision in the work that the model (training) sampling of memes must correspond to the complex structure of the network and be convenient when tracing the responses of network agents. Every meme has its own life cycle. The more that is known in the analyzed sampling about the moment of the meme’s birth, so much the better for the studying of the process of distribution dynamics. Therefore, a lot of newsworthy memes are distinguished. The most popular simple meme (impulse meme) corresponds to the known event. The bright events are accompanied with the popular and widely distributed memes. The anticipated event (political, sports and other) may be assigned with the predictable meme which appears on the network. The request for such meme must be restricted by several key words. Moreover, such impulse meme must be created beforehand being on standby. The beginning of the life cycle of this meme will be known. Concerning the algorithms where the distribution of dedicated meme is to be traced, the monitoring of agent network is carried out only at the given points in time. The spatiotemporal dynamics of the network at any specific time is characterized by the metric characterization of meme travel over the network (arcs and points of network graphs). The generalized algorithm of analysis of Internet memes distribution and meme network forming consists of several stages: 1) Composing of key words peculiar to the dedicated class of memes on the basis of frequencies of mention;
196 Recognition and Perception of Images 2) Determining of time interval of study and parameters of monitoring discontinuity; 3) Analysis of the most important test memes; 4) Analysis of the graph structure of network distribution of test memes and its visualization; 5) Clustering of meme network communities with due regard to the dedicated test (experimental) memes; 6) Analysis of agent profiles participating in the development and distribution of memes. The data group (the integrity of which is analyzed in Kibana (Discovery) [www.elastic.co/kibana]) is associated with every key word and image of tested meme. For distribution of Internet memes, bot-network technologies are used. The bots, replicating the content in the form of Internet memes, make the objective analysis of meme distribution over the network difficult. They create their bot network. On the other hand, the specialized bots may perform all volume of works regarding the analysis of memes when visiting the social network communities. The bot agents are effectively working according to the formulated strategies directed toward the target audience and those algorithms which are based on the tasks of routing in the complex networks (in this case social ones) [Germanchuk et al., 2018], [Germanchuk et al., 2019]. The frequency analysis of Internet memes allocating, profiling of the relevant static and dynamic accounts and visualization of graphs of agents, memes and users interconnection are used to analyze the meme and bot space in the social networks. The graphic representation reflects the account profiles which consist of the different combinations of user agents, events, groups, pages, memes and certain memes. The software applied for the social network, Vkontakte, is often used for this: Elastic Search, Kibana (Discovery, Visualize, Dashboard), PHP script to download and process the information with the help of VK, IP and Tableau. Thus, there is some environment (in particular, the social network) for distribution of Internet memes. The objects of this environment are represented by memes with characteristics peculiar to the distribution environment. The task consists in the development and implementation of tools meant to classify the objects of targeted environment of Internet memes distribution as well as forecast their distribution. Due to the openness of the viral meme images the master list where every meme is (or will be) described and Internet management structure that indexes new memes according to some convention are missing.
Development of the Toolkit to Process the Internet Memes 197 This means that the trained model cannot be taught for every class which existed earlier or may appear in the future. Therefore, adaptive models and algorithms for analysis and forecast are required. The authors choose the social network VKontakte as the distribution environment. The records with the text and image in the form of Internet meme will be considered the objects of this environment. For example, we will classify every concerned Internet meme as a political or non-political one. The expected system input is a reference to a meme in the social network VKontakte or direct image with the text. The final objective is represented by the system which may recognize and correctly mark the Internet meme as well as forecast its distribution. Considering the specifics of Internet memes this task cannot be solved by means of the usual algorithms. To solve the problem it is necessary to work out the methods and algorithms including the complex of algorithms for detection, recognition and classification of the image and text as well as for usage of network metrics and clustering algorithms. To develop the specific tools for such system it is required to solve the following in-between tasks: 1) text extraction from meme image; 2) classification of text on the image; 3) usage of different network metrics to correct the classification of Internet memes and develop the forecasting system of Internet memes distribution; 4) analysis of data resulting from the system usage; 5) clustering of Internet memes base and their visualization; 6) study of Internet memes life cycle.
6.3 Intellectualization of System for Processing the Internet Meme Data Flow Let’s consider the task of meme life cycle research. The characteristics of the meme life cycle are important for studying the process of meme distribution. To collect the information on the meme life cycle it is necessary to get the information about its actuality during a certain time interval. One of the ways to obtain this information is its delivery by the search service on request (the search services give priority to current information). Although the search results may be not latest ones they give an opportunity to find the true picture of memes distribution in
198 Recognition and Perception of Images the Internet network. The results of a Google search are used to get the memes sampling. To get the results of a Google search automatically it is necessary to consider the main principles of interaction with the Internet services to which Google search belongs. In the course of Internet service development the creator or company takes a decision on the development of program interface (API) to ensure the easy and friendly usage of their service by thirdparty project creators. Unfortunately, Google search has no publicly visible API; that’s why the search results will be obtained from HTML pages. This approach pertains to the part of program running: Create the search query for Google
Get HTML results page
Extract information on images
The creation of a web query and obtaining of a result by means of presentday software applications is not difficult. To get the search results in the form of images it is necessary to use the following url query:
https://www.google.com/search?q={Query}&tbm=isch, where instead of {Query} it is required to place the desired query, such as “meme”. The more difficult task is just to extract the information on images from the received result. The main methods to create the web pages used in the contemporary Internet space are used for this. The first method implies to the generation of server-side pages and their further sending to client (web browser). It is the earlier way to generate the pages that refers to sending of the larger data volume (e.g., in php). The second (later method) consists of the creation of a minimal HTML page on the server and further downloading of client data on it by means of additional queries. This method allows to save the traffic and speeds up the time of page load. The Google search uses the modified version of the first method. The querying client obtains the half-completed HTML document in which the piece of information is complemented with the help of javascript (Figures 6.1, 6.2). It was determined after analysis of query results that the main information on images (reference to the image and page from which it was received) is at the end of the document and is represented in the form of javascript object used in future to display the page elements. Thus, it is necessary to carry out the following operation chaining to obtain the information about images:
Development of the Toolkit to Process the Internet Memes 199 1 reference public IList ParsePage(HtmlDocument doc) { var script = doc.DocumentNode.SelectNodes("//script").TakeLast(2).First().Inner Text; var indexBegin = script.Index0f('['); var indexEnd = script.LastIndex0f(']'); var infoArray = script.Substring(indexBegin, indexEnd – indexBegin + 1); JArray jarray; try { jarray = JArray.Parse(infoArray) .SelectToken("[31][@][12][2]") .Value() ; } catch (Exception e) { Console.WriteLine(e); throw; } return jarray.Select(e => GoogleImage.Parse(e.Value()) as SearchImageBase).Where(e => e != null).ToList(); }
Figure 6.1 Selection of necessary information on images.
1 reference public static GoogleImage Parse(JArray raw) { var res = new GoogleImage(); res.Direct = raw.SelectToken("[1][3][0]")?.Value(); res.SourcePage = raw.SelectToken("[1][9].2003[2]")?.Value(); res; }
return (res.Direct == null || res.SourcePage == null) ? null:
Figure 6.2 Creation of objects with information on images.
Take the information content of tag script
Separate out the object with data
Take the necessary array from it
Choose the information on images
It looks in the following way featuring the form of C# program code: Thus, the information on all images is obtained on the search query. Further, the quantitative measures of dedicated meme (flow of Internet memes) staying in the information space of communities are traced. Let’s consider the issues of implementation of technology for the study of memes life cycle with the help of developed web application FrontEnd.
200 Recognition and Perception of Images The development of software in the form of web application is caused by the following reasons: 1) cross-platform: the web application is available on any device with any operational system if it has the installed web browser; 2) friendly user interface of application: if using the contemporary technologies of interfaces design it is possible to implement the convenient, interactive, prompt and attractive user interface (UI) in the faster and easier way; 3) opportunity to update the client-side and server-side portion at any time convenient for the creator and do it with less effort; 4) taking of the computational power to the server-side web service: such complex operations as Internet web search as well as storage and processing of bulk data are carried out on the dedicated servers of web services, thus allowing to decrease the consumption of resources of end user computer. Let’s study the structure of FrontEnd web application. Within the scope of research scientific work carried out together with the social scientists of Crimean Federal University the control sampling of images from Internet network was chosen to analyze the life cycle of Internet memes and build the system of sociological survey. This requires the preliminary preparation of available data implying the use of developed software: –– to group the Internet memes by the expert group specializing in the particular markers; –– to sample the Internet memes by the set markers; –– to create, save and execute the expert queries in the form of expression statements of mathematical logic for Internet meme sampling; –– to select the areas on the image of Internet memes so as to set the markers of this region. FrontЕnd application is written in TypeScript language that is the extension of JavaScript language. The use of genuine JavaScript in rich applications increases the software development challenges because this programming language features the dynamic typing that results in unexpected errors when executing the code. TypeScript differs from JavaScript by the special feature of the definite static assignment of types, supported
Development of the Toolkit to Process the Internet Memes 201 use of full-blown classes (as in traditional object-oriented languages) as well as the support for modules connection intended to increase the development speed, simplify the readability, refactoring and code reuse, assist with the search of errors at the stage of processing and compilation. Finally, TypeScript is compiled into JavaScript and can be performed both in the browser and on NodeJS platform [https://nodejs.org/en/]. ReactJS library [https://ru.reactjs.org/] which allows the creation of the graphical user interfaces for web applications was selected for UI design. Virtual DOM, i.e., a virtual tree of web page elements is used for working with Document Object Model (DOM) in React. In case of making changes in Virtual DOM the browser DOM is automatically changed in the way to correspond to Virtual DOM. React allows to design the interfaces while using the programming paradigm called the component-oriented programming. React is based on the components which are reused independent blocks each of which features its own state and functionality. Redux library was chosen for convenient control of application status. Redux describes itself as the predictive container of state for JavaScript applications. Redux offers to think about the application as about the creation state modified by the actions sequence. React.js together with the client-side Redux allows to create MVC architecture of application. MVC (model, view, controller) architecture assumes that model is the single source of truth and the whole status is stored here. The presentations are derived models which must be synchronized when model changes its state. The applications written in React + Redux are like the nonderministic finite-state automation. React.Router is used for application routing (routing inside the clientside application). The router determines which presentation is to be displayed to the user. By the aid of router it was able to create the single-page application (SPA). This means that web application uses one HTML document and implements the client interaction by means of the dynamic upload of styles and scripts. The advantage of SPA is that they are similar to native applications except that they are created within web browser. The pageflow is more seamless thus positively impacting the user experience (UX) and allowing to increase the speed of page response because the application doesn’t need to upload the full HTML file. The interfaces are put together by means of JavaScript build tool for webpack modules. The task of webpack is to build the single web application bundle from JavaScript modules, CSS and HTML files. Additionally, Babel.js tool is used. Its task consists in the conversion of code written in JavaScript of latest standard (ECMAScript 6 is used in the
202 Recognition and Perception of Images project) into JavaScript of earlier standard to be supported by the earlier browsers. Bootstrap 4 library designed by Twitter Company is used to set the styles of elements on the page. It simplifies the setting of page styles because you may not write the custom CSS code but must use the already prepared classes for benchmarking. The appendix manipulates the entities: Area is a region marked on the image. It is a class containing the information about the region. It stores such data as image id in base, region id in base, its coordinates and region size as well as list of tags relating to it. Tag is a tag entity. It is represented by the object with the field of id tag in base and its name. ImageInfo is a main entity uniting the other atomic entities. This entity is a representation of image in application. It aggregates such fields as id, parameters of image width and height, date of its upload, array with image areas, array with tags relating to the image and url (its allocation in the server file system). Query is a logical expression for images sampling from the data base. The logical expression is represented by the structure in JSON format that features a form of tree where each element has its type (logical operator or operand), text value and child elements array. The components of application: TagSuggestSelect is represented by quotations by data base tags. This component is a wrapper over Creatable component taken from reactselect library. Before creating the component the tags are converted from Tag entity into TagSelectOption which is adopted by Creatable component as the array of possible values. Two methods are implemented in the component: onTagInputChange and onTagChange. The first method is activated in case of entry field change while quering API to resend the tags suitable by the search criterion. The second method is activated in case of tag selection. It carries out one of the actions: if such a tag exists it will be selected from the list of existing ones; if it doesn’t exist it will be added to the database. Tlist is a list of entities sent to it and represented in the form of a column. This block is used on the tag and query pages. By means of props it accepts the array of any objects with name field and then renders them. Moreover, props may be used to upload the function to list. Such function will be activated when pressing any of the elements from the list. QueryBlockSuggestSelect is represented by the quotation to select the elements of visual representation of query for images sampling. It ensures the search by names of tags, operators and saved queries.
Development of the Toolkit to Process the Internet Memes 203 QueryBlock is a block that is a visual presentation of expert query. The parsing and JSON query rendering methods are implemented in it. The operators, tags and queries from the database are parsed in QueryBlockSuggestSelect. Then the button for deleting the item from query is added from the left. Modal is a popup window which overlaps the main content of the page. It accepts the elements through props which are displayed in the title bar, body and liner of modal window. ImageStrip is a block which displays the images in the form of “tiles”. It converts the array of ImageInfo entities into tiles and displays them on the screen. Moreover, it accepts the selected image through props and highlights the tile with it. In case of upload of any function through props deleteImage the cross is rendered in the right corner. The transferred function is activated when clicking on this button. Additionally, the tile clicking results in activation of selectImage function which may be also uploaded through props (Figure 6.3). ImageDetails is a block which displays the information on an image and carries out the main work with the image. The upper part is the image itself over which ColoredRect component used for performing the manipulations with the image areas is rendered. The lower part is an area meant
Figure 6.3 Display of images in the form of tiles (ImageStrip).
204 Recognition and Perception of Images for the work with image tags. The left column is represented by TList with image tags, and the right column – by TagSuggestSelect. Here you can both add the existing tags and create the new ones. ColoredRect is a block where the work with image areas is carried out. It accepts the array of image areas and displays them in the form of semi- transparent grey rectangles enumerating the tags assigned to this area. There is a red circle in the upper-right corner which clicking on results in the area deletion. The block will be overlapped by “veiling” on which it is possible to perform manipulations with the area tags when clicking on any of areas. Here you can delete, add the tags existing in the base and create new ones. The new area is created by means of drag-n-drop. The block is based on React-konva component which is represented by HTML5 canvas adapted for the work with react and its data-flow. Representations. The user interface is divided into three representations, each of which is responsible for its part of application features. The first representation is a form of image loading on the server. It allows the loading. The user clicks on “Select file” button and chooses the image located in its local store. The file is captured in component state. After clicking on “Upload image” button uploadImage action is displayed, and the image is sent to the server for storage. The second screen is a main window to work with the uploaded images. The extreme left column is represented by TList block. It is possible to delete tags and filter the images by a certain tag in this column. Moreover, there is a button for displaying all images independently from their tags. The middle column is ImageStrip component. The image selection and rendering of the extreme right component of ImageDetails with the data on the selected image are carried out when clicking on the image (Figure 6.4). The third display is represented by the query page and consists of three blocks: TList, ImageStrip and QueryBlock. It is possible here to carry out such operations with queries as creation, storing and search of images by the specific query. Hardware-software part of service (Back-end) is written in C# language. MsSQL is used as the database system. The framework for C# language called ASP.NET is used for Web API implementation. API is implemented in REST style. ASP.NET Web API is the most updated method to create REST service on Microsoft technologies stack to this date. ASP.NET is responsible for creation of web server processing HTTP queries. Moreover, it includes a great number of services to ensure the routing inside the server-side application, installation of controllers and connection to the data base.
Development of the Toolkit to Process the Internet Memes 205
Figure 6.4 Display form of user interface.
Entity Framework Core (EF Core) is used to interact with the data base. EF Core is ORM library. ORM (Object-Relational Mapping) is a layer between the database and program code. Its task consists in ensuring the work with data of base not as with tables but with classes from the object-oriented paradigm. EF Core allows to achieve the higher level of abstraction when working with the objects as well as deal with the data regardless of repository type. Additionally, the framework features API to work with data and interact with the base. This API is universal for any DBMS supported by the framework. EF supports such DBMS as MS SQL Server, SQLite, PostgreSQL and MySQL. If it is decided to change DBMS the only required thing is to change the data base connection settings. The code of interaction with the base will be the same, and it will require no changes (Figure 6.5). The information on the image which is stored in the table “Images” and features the one-to-many relationship with the table “Areas”, and many-tomany relationship with the table “Tags” is a key entity in the application. Moreover, the table Areas features the many-to-many relationship to Tags table. Controllers. API consists of three controllers, each of which is responsible for the work with its entity. Every controller is divided into end points, each of which completes its task due to the method of HTTP query and requested URL. Practically all end points return the data to client in JSON format. However, there are end points that return the files located on the server to user.
206 Recognition and Perception of Images Areas Id Queries Id
ImageId X
Name
Y
ContentJson
Width Height
Images Id Width
AreaTags AreaId TagId
Height Path Source Comment Tags Id Name ImageTrends ImageId TrendId
ImageTags Trends Id Date
Figure 6.5 Database diagram.
ImageId TagId
Development of the Toolkit to Process the Internet Memes 207
6.4 Implementation of Intellectual System for Recognition of Internet Meme Data Flow As Internet memes are images containing the pattern and text it is necessary to work out the recognition system (OCR) which detects the text in noisy image and then extracts it. To extract the text from the Internet meme OCR must: 1) find the text; 2) pre-process the image that contains the text; 3) recognize the text. Initially, the entrée text recognition software (Tesseract) drawn up by Hewlett-Packard Company was used to recognize the text. Tesseract allows the recognition of the text based on the neural network LSTM [Hochreiter & Schmidhuber, 1997]. However, to get good recognition results it is necessary to primarily improve the quality of image transmitted to Tesseract. The images are processed in several steps. The check of keylines nesting and search of child keylines allows to detect and recognize both black text against the white background and vice versa. At this stage the keylines are merged into lines, and lines – into text. The text lines are divided into words due to the line spacing. The second step embraces the two-stage process of text recognition. The attempt to recognize every word by turns is made at the first stage. Every word recognized by the classifier with the high confidence level is transmitted to adaptive classifier as the training data. Thereafter, the adaptive classifier is permitted to recognize the remaining text more precisely. However, it was made clear in the process of development that it is not always possible to extract the text overlaying the image as well as obtain the satisfactory results when classifying the small phrases and word combinations peculiar to Internet meme by means of Tesseract. Due to this it became necessary to design the own specialized OCR. EAST algorithm (Efficient and AccurateScene Text Detector) using the fully connected convoluted neural network which makes decisions based on the level of words and lines was selected to solve the tasks of text search on the image [Tian et al., 2016]. This algorithm differs by the high precision and speed. The key algorithm component is represented by the model of neural network which is taught to directly forecast the existence of text instances and their geometry in the source images. This model is a fully connected convoluted neural network adopted to detect the text that produces the predictions of words and text lines for every pixel. This excludes such subphases as candidate offering, composition of text area and splitting of words. The subsequent processing stages include only the threshold value and threshold for the predicted geometrical patterns.
208 Recognition and Perception of Images At the stage of pre-processing of images most of which is represented by text it is necessary to select only text and get away with the noises. The problem is that the text color is unknown and that text may embrace several tints of one color. The pre-processing includes the clustering of images, building of mask that separates the text from the rest background and determination of background color. The convoluted neural network is used as the algorithm of symbols classification. The used convoluted neural network (CNN or ConvNet) belongs to the class of deep neural networks which are typically applied for the analysis of visual patterns. CNN networks are the regularized versions of multi-layered perseptrons. The multi-layered perseptrons usually belong to the fully connected networks, i.e., when every neuron of one layer is connected with all neurons in the next layer. The “full connection” of these networks makes them prone to the data overrun. The typical ways of regularization include the adding of some form of scale measuring to the loss function. However, CNN applies another approach to regularization. Using the advantages of hierarchical data framework the more complex frameworks are taken from the smaller and simpler ones. Thus, CNN is placed at the lower level according to the connection and complexity scale. ConvNet network may successfully capture the spatial and time dependencies of image by means of the relevant filters. The architecture of convoluted neural network ensures the better coordination with the image data base due to the lower number of used parameters and opportunity to reuse the scales. In other words, the network may be taught to better understand the complex images [Schmidhuber, 2014]. The architecture of convoluted neural network for letters classification is represented in Figure 6.6. The input data are represented by the image of RGB color model that is divided by three color planes (red, green and blue). ConvNet role is to convert the image into form that is processed easier without losing the marker that play a pivotal role for obtaining the good forecast. The training sample including 59567 objects divided into 37 classes (33 upper case letters of the Russian alphabet and 4 lower case letters «а», «б», «е», «ё») was used as the data. In the creation of the sampling, 1,814 fonts were used. The sampling is divided into the training one with 47,653 objects and test sampling with 11,914 objects. The neural network was implemented on Python with the use of Keras library. The model of convoluted neural network for letters classification was taught on 30 epochs demonstrating the quality of 98.8% (for training data).
Development of the Toolkit to Process the Internet Memes 209 Convolution layer with 3×3 core and linear activator, input data dimension – 50×50×3
LeakyRelu activation layer with alpha=0.1
Pulling layer 2×2
Convolution layer with 3×3 core and linear activator, input data dimension – 50×50×3
LeakyRelu activation layer with alpha=0.1
Pulling layer 2×2
Convolution layer with 3×3 core and linear activator, input data dimension – 50×50×3
LeakyRelu activation layer with alpha=0.1 Pulling layer 2×2
Smoothing layer
The layer with the dimension of 128 and linear activator
Softmax activation layer with the dimension of n=number of classes
Figure 6.6 Main steps of text determination algorithm.
210 Recognition and Perception of Images The centers of mass for dedicated keylines are found at the stage of lines recognition and combination of recognized letters into words. Thereafter, the main lines drawn up the text are built. If the upper and lower points are above and over the text line this keyline is also placed on this line. If the distance between keyline is larger than average distance multiplied by specified coefficient this means that there is a white space between the keylines. This allows not only the definition of the regular sequence of symbols processing but also releases from the noises which could be left after image processing. Furthermore, the keylines are united into the words according to the intervals between the letters. The pseudocode of program for text recognition is written as: Algorithm Input:
Sampling Xm={x1,…,xm}
k – number of clusters;
treshold – threshold of` intersymbol distance; output:
output_text – image text 1. Image clustering 2. Mask building
3. Letter contours extraction lines = []
for cnt in contours:
if moment of cnt not in lines then add cnt in line for line in lines:
mean_dist = mean_distance_between_contours for do:
if dist(cnt[i], cnt[i+1) > mean_dist + threshold then separate
cnt[i] and cnt[i+1] by space result = “”
for do:
for do:
result += Classification cnt[i]
The designed OCR properly highlights the text and classifies the short words. However, in the course of design when classifying the large text volumes problems arose. Therefore, the final version of program for text
Development of the Toolkit to Process the Internet Memes 211 extraction from the image includes two stages. At the first stage EAST algorithm extracts the text blocks. In the case of numerous blocks which form the large group Tesseract is used to recognize them. If there are few text blocks the described approach to text extraction is applied (Figures 6.7, 6.8). The pseudocode of the final algorithm for extraction and recognition of Internet meme text: img; ◃ n = minimal number of words; textblocks = EAST(img); ◃ join all text blocks which are crossed; if len(textblocks) > n then txt = Tesseract(textblocks); else txt = TEXTEXTRACTOR(textblocks); show txt;
Figure 6.7 Main steps of the Russian text recognition algorithm.
Figure 6.8 Work of algorithm recognizing the English text.
212 Recognition and Perception of Images Let’s analyze the stage of classification of text extracted in the course of recognition of Internet meme text. The social network VKontakte is selected as the environment of Internet meme distribution. The classification objects are represented by the records obtained at the stage of text extraction. The records contain the text, meme image and comments. The objects of selected distribution environment will be classified as “political” and “non-political” ones. The object class is determined by the class of record text, Internet meme text and comments text. Thus, it is necessary to build the classifier which can determine if the text belongs to the political class or not. As VKontakte is mainly oriented toward the Russian-speaking population, initially the work was carried out with the memes containing the Russian text. Therefore, the sample consisting of 63 political and 44 non-political memes was provided as an example to solve this task. The text of comments taken from VKontakte that features both political and non-political content was used as the additional objects. Generally, 208 sentences were used for analysis. The sample consisting of 16 political and 10 non-political memes was provided for validation. The data were normalized by the stemming algorithm and converted into the vector type before the classifier training. Considering the small size of training data the support vectors method was selected as the classification algorithm [Vorontsov, 2011]. The word clouds (Figures 6.9 and 6.10) from which can be observed the class of “political memes” embraces the images with the names of countries, regions as well as names of political leaders are built for both classes. The class of “non-political memes” didn’t feature the clearly distinguished group of words because the sentences of this class were of various themes. The classifier accuracy for test sampling was 90.25%. The classifier accuracy will be improved with the increase of training data. Thirty-three political and 24 non-political English memes were sampled in the course of preparation of this article. In total, 112 sentences were analyzed. The sampling consisting of 11 political and 7 non-political memes was provided for validation. The English memes were taken not from the social network VKontakte but selected with the help of Google search engine. The word clouds of both classes with English memes are shown in Figures 6.11 and 6.12. It interesting that here the classifier accuracy for test sampling was 94.05%. Let’s consider the process of data collection and analysis by means of system processing and analyzing the Russian Internet memes. Five groups of the social network VKontakte were selected for analysis. It was able to extract 43 political memes by means of the designed toolkit. The lists of
Development of the Toolkit to Process the Internet Memes 213
Figure 6.9 Cloud of Russian words specifying the objects of non-political class.
Figure 6.10 Cloud of Russian words specifying the objects of political class.
participants who assessed the selected record were provided after obtaining the list of political memes. The cities where the selected participants live were put on another list. It was able to gather 66,184 participants in total. To get the cities list we constructed a pie graph representing the most politically active regions of Russia. The most active regions are: Moscow, Saint Petersburg, Ekaterinburg, Novosibirsk and Krasnodar. The cities where the rate of
214 Recognition and Perception of Images
Figure 6.11 Cloud of English words specifying the objects of non-political class.
Figure 6.12 Cloud of English words specifying the objects of political class.
occurrence is less than 0.4% were put in the separate category called “another”. To unite the cities with the same political activity we applied the k-means algorithm to the obtained data [Vorontsov, 2011]. The data were divided into five clusters. The first cluster embraced the cities with the minimal interest in political memes: Korolev, Pskov, Nizhnevartovsk,
Development of the Toolkit to Process the Internet Memes 215 Blagoveshchensk, Engels and Taganrog. Moscow city also fell into the first cluster. The second cluster included the following cities: Ekaterinburg, Novosibirsk, Krasnodar, Rostov-on-Don, Nizhny Novgorod, Chelyabinsk, Perm and Samara. The third cluster was represented by Saint Petersburg. The fourth cluster: Tomsk, Minsk, Saratov, Tyumen, Kaliningrad, Vladivostok, Yaroslavl and Irkutsk (Figure 6.13).
Moscow 20.1 % Saint Petersburg 13.4 % Yekaterinburg 3.6 % Novosibirsk 1.9 % Krasnodar 1.7 % Rostov-on-don 1.5 % Nizhny Novgorod 1.5 % Chelyabinsk 1.3 % Perm 1.3 % Samara 1.3 % Krasnoyarsk 1.2 % Voronezh 1.2 % Kazan is 1.1 % Ufa 1.0 % Volgograd 1.0 % Omsk 1.0 % Tomsk 0.8 % Minsk 0.8 % Saratov 0.8 % Tyumen 0.7 % Kaliningrad 0.7 % Vladivostok 0.7 % Yaroslavl 0.6 % Irkutsk 0.5 % Izhevsk 0.5 % Tula 0.5 % Barnaul 0.5 % Tver 0.5 % Ryazan 0.4 % Ulyanovsk 0.4 % 0.00%
5.00%
10.00%
15.00%
Figure 6.13 Frequency diagram for cities with the political activity.
20.00%
25.00%
216 Recognition and Perception of Images
6.5 Conclusion The algorithms and components of intellectualized system of Internet meme images processing (including the implementation of image information request system, implementation of technology for studying the meme life cycle (FrontEnd application)) were provided. The algorithms and software ensuring the text extraction from the image, text classification and data clustering are developed. The comparative analysis of different algorithms of text and images classification is carried out. The algorithms of text extraction from the image and their visualization are implemented: –– –– –– –– –– –– ––
identification of text by means of EAST algorithm; grouping of images with the text; pre-processing of images with the text; marking of text lines on the images; extraction of letter contours; letters classification; uniting of letters into the text.
The system for processing and analysis of Internet meme flow is developed. The disadvantage of the model is that it showed only 90% rate accuracy when determining if the record belongs to the political memes or not. It is associated with the small set of training data, lack of image patterns classification and mistakes made in the course of text extraction from the images. To enhance the quality of classification it is necessary to increase the amount of training data, improve the algorithm of images extraction and implement the image recognition algorithm. The results are particularly represented in the works of [Germanchuk et al., 2018], [Germanchuk et al., 2019]. In the course of further development implying the use of network metrics it is planned to implement the algorithms which can forecast whether the meme will be popular or not.
Acknowledgment This work was supported by the Russian Foundation for Basic research, grant 20-011-31460\20, entitled: «Political meme in virtual communications:
Development of the Toolkit to Process the Internet Memes 217 creating a software product for monitoring the spread and impact (analysis of the Russian-language segment of the Internet)».
References Bart R. Izbrannie raboty. Semiotika. Poetika. [Selected texts. Semiotics. Poetics] Translated from French /Author, gen. edit. and foreword of article by G.K. Kosikova / R. Bart. М.: Progress, 1989. 616 p. Batura T. Modeli i metody analiza kompiuternykh sotsialnykh setei [Models and methods to analyze the computer social networks] // Software programs and systems. No.3(103), 2013. P. 130-137. doi:10.15827/0236-235Х Beniamin V. Proizvedenie iskusstva v epokhu ego tekhnicheskoi vosproizvodimosti [Piece of art in the time of its technical reproducibility]. – М.:MEDIUM, 1996. 125 p. Bernovsky М. М., Kuziurin N.N. Sluchainye grafy, modeli i generator bezmasshtabnykh grafov [Random graphs, models and generators of scale-free graphs] // Proceedings of Institute of system programming of the Russian Academy of Sciences. 2012. V. 22. P. 419–432. Brody R. Psikhicheskie virusy. Kak programmiruyut vashe soznanie [Psychic viruses. How your consciousness is programmed]. М.: Pokolenie, 2007. 304 p. http://www.uhlib.ru/psihologija/psihicheskie_virusy/index.php Chzhan P., Zakharov V.P. Kompiuternaya vizualizatsiya russkoi yazykovoi kartiny mira [Computer visualization of the Russian linguistic world image]. International Journal of Open Information Technologies. DOI: 10.24412/ FfNhqoIQLC0 (2020). Germanchuk М. S., Kozlova М. G., Lukianenko V. А. Problematika modelirovaniya protestov rasprostranenia internet memov [Problems of modeling of Internet meme distribution processes] / Analysis, modeling, management and development of social and economic systems: collection of research papers of XII International School-Simposium ‘AMMD-2018’, Simferopol-Sudak, September 14-27, 2018 / Endorsed by A.V. Sigal. Simferopol: IP Kornienko А. А., 2018. P. 136-139. Germanchuk М. S., Kozlova М. G., Lukianenko V. А., Pivivar А. Ye. Razrabotka instrumentariya obrabotki i analiza potoka internet memov [Development of toolkit for processing and analysis of Internet memes flow] // Collection of research papers of All-Russian Research and Training Conference of MIKMO2019 and Taurian Scientific Conference School of Students and Young Professionals in Mathematics and Informatics / Eds. V.A. Lukianenko. – Simferopol: IP Kornienko А. А., 2019. Edition 1. P. 121-127. Germanchuk М. S., Kozlova М. G. Raspoznavanie, analiz i vizualizatsiya internet memov [Recognition, analysis and visualization of Internet memes] // Mathematical methods of patterns recognition: Scientific abstracts of 19th
218 Recognition and Perception of Images All-Russian Conference with international participation, Moscow, 2019. М.: Russian Academy of Sciences, 2019. P. 351-355. Germanchuk М. S., Kozlova М. G., Lukianenko V. А. Osobennosti razrabotki intellektualnoi sistemy obrabotki potoka internet memov [Peculiarities of development of intellectual system for Internet memes flow processing] // Collected book ‘Distance learning technologies’, Proceedings of IV All-Russian Research and Training Conference (with international participation) / Main editor V.N. Taran. Simferopol, 2019. P. 258-265. (RSCI) Gorshkov S. Vvedeniye v ontologicheskoye modelirovaniye [Introduction into ontology-based modeling]. Ekaterinburg: Trinidata, 2016. 165 p. https://trinidata. ru/files/SemanticIntro.pdf Gubanov D. А., Novikov D. А., Chkhartishvili А. G. Sotsialniye seti: modeli informatsionnogo vliyaniya, upravleniya i protivoborstva [Social networks: models of informational effect, management and confrontation]. М.: Fizmatlit, 2010. 228 p. Hofstadter D. R. Metamagical Themas: Questing for the Essence of Mind and Pattern. Basic Books 1985. 852 p. http//archive.org/details/MetamagicalThemas Hochreiter S., Schmidhuber J. Long Short-term Memory. Neural computation. 1997. 9. 1735-80. DOI: 10.1162/neco.1997.9.8.1735. Kanashina S.V. Effekt obmanutogo ozhidaniya v internet memah kak osobaya kommunikativnaya strategiya [Effect of failed expectations in Internet memes as the special communicative strategy] // TSPU Bulletin, 2017. 10(187). P. 9-14. DOI:IO.23951/1609-624X-2017-10-9-14. Khristakis N. Sviazanniye odnoi setiyu: kak na nas vliyayut liudi, kotorykh my nikogda ne videli [Connected by one network: How people we never see affect us] / N. Khristakis, D. Fowler / М.: United Press LLC, 2011. 361 p. Kibana. Your window into the Elastic Stack / https://www.elastic.co/kibana Kutuzov, A., Kuzmenko, E. WebVectors: A Toolkit for Building Web Interfaces for Vector Semantic Models. Communications in Computer and Information Science. 2017. 661. 155-161. DOI: 10.1007/978-3-319-52920-2_15. Lutovinova O.V. Demotivator kak vid setevogo tvorchestva [Demotivator as type of network art] // Bulletin of Volgograd State University. Series 2, Study of Language. 2016. V. 15. No.3. P. 28-31. DOI: https://doi.org110.15688/ jvolsu2.2016.3.3 Namiot D. Ye., Makarychev I. P. Ob alternativnoi modeli otmetki mestopolozheniya v sotsialnykh setiakh. [On the alternative model to identify the position in the social networks] // International Journal of Open Information Technologies ISSN: 2307-8162. V.8, No.2, 2020, P.74–90. (2020) Newman M. A Measure of Betweenness Centrality Based on Random Walks. Social Networks, 2005. 27, 39-54. http://dx.doi.org/10.1016/j.socnet.2004.11.009 Osgood Ch. E., Suci G. J., Tannenbaum P. H. The Measurement of Meaning. University of Illinois Press, 1957. 344 p. Poliakov Ye. М. Kibernetika, memetika i teoriya massovoi kommunikatsii: obzor yestestvennonauchnykh podkhodov k problemam sotsiologii [Cybernetics,
Development of the Toolkit to Process the Internet Memes 219 memetics and mass communication theory: survey of natural-science approaches to the sociology problems] // Man. Community. Management. 2009. No.3. P.32-41. Rashkoff D. Mediavirus. Kak pop-kultura taino vozdeistvuyet na nashe soznaniye [Media virus. How popular culture secretly influences our consciousness]. М.: Ultra.Cultura, 2003. 368 p. [Rashkoff D. Media Virus! Hidden Agendas in Popular Culture. Moscow: Ultra Cultura Publ., 2003. 368 p. (in Russian)] Rogushina Yu. V. Ispolzovaniye ontologicheskoi modeli pri semanticheskom poiske informatsionnykh obiyektov [The usage of ontology-based model in the course of semantic search of information objects] // Designing Ontology. V. 5, No.3 (17), 2015. doi:10.18287/2223-9537-2015-5-3-336-356 Schmidhuber J. Deep Learning in Neural Networks: An Overview. Neural Networks. 2014. http://dx.doi.org/10.1016/j.neunet.2014.09.003 Sedakov A. A, Zhen M. Opinion dynamics game in a social network with two influence nodes // Bulletin of Saint-Petersburg State University. Applied mathematics. Informatics. Management processes. 2019. Т.15. Edition 1. P. 118125. https://doi.org/10.21638/11702/spbu10.2019.109 Social Network Analysis: How to guide. Open Government Licence v3.0 https:// assets.publishing.service.gov.uk Tian, Z., Huang, W., Tong, H., He, P., Qiao, Y. Detecting Text in Natural Image with Connectionist Text Proposal Network. 2016. 9912. 56-72. DOI: 10.1007/978-3-319-46484-8_4. Ukhova L. V., Baslina Ye. Yu. Demotivatsionny poster kak zhanr setevogo yumora [Demotivational poster as genre of network humour] // Yaroslavl Pedagogical Bulletin. 2014. No.1. Volume 1 (Humanities). P. 135-140. [Ukhova L.V., Baslina E. Yu. Demotivational poster as genre of Network Humour // Yaroslavl Pedagogical Bulletin. 2014. 1. Vol. 1 (Humanities). P. 135-140.] Viner N. Kibernetika i obshchestvo. [Cybertetics and society] М.: Foreign literature, 1958. 200 p. Vorontsov K.V. Matematicheskie metody obucheniya po protsentam. Kurs lektzyi po mashinnomu obucheniyu.[Mathematical methods of learning from examples. Course of lectures on the computer-aided learning] (2011) http://www. machinelearning.ru/wiki/images/6/6d/Voron-ML1.pdf Vu D. Generating Word Clouds in Python. 2019/ https://www.datacamp.com/ community/tutorials/wordcloud-python Zakharov V.P. Funktsionalnost instrumentov korpusnoi lingvistiki. Strukturnaya i prikladnaya lingvistika [Functionality of instruments of corpus-based linguistics. Structural and applied linguistics]. SP.: Publishing office of Saint Petersburg Uniersity, 2019. V. 12. P. 81-95.
7 The Use of the Mathematical Apparatus of Spatial Granulation in The Problems of Perception and Image Recognition Sergey A. Butenkov1, Vitaly V. Krivsha2* and Nataly S. Krivsha1 Southern Federal University, Academy of Engineering and Technology, Department of Mathematics, Taganrog, Russia 2 Southern Federal University, Academy of Engineering and Technology, Scientific Research Center of Supercomputing, Taganrog, Russia 1
Abstract
The current chapter presents the new techniques and algorithms for the perception and recognition of images, based on the mathematic approach to the space data granulation. All stages of image perception and recognition (Computational Intelligence problems) are performed with the very common and integrated spatial granules model that optimally presents the large data amount for an arbitrary space dimension and under the information uncertainty conditions. The main results are the technical implementation of L. Zadeh Machine of Granular Computing. The presented techniques and algorithms may be adopted for large classes of image-like data, as the various signals spectral portraits, etc. Keywords: Affine space, character recognition, fuzzy relation, granulation algorithm, matroid, perceptron, shannon’s entropy, vague regions
7.1 Introduction There is increasing interest in image processing in diverse application areas, such as multimedia data processing, secured image data communication, biomedical images, automatic biometrics, computer remote sensing, pattern recognition and so on. As a result, it has become extremely *Corresponding author: [email protected] Iftikhar B. Abbasov (ed.) Recognition and Perception of Images: Fundamentals and Applications, (221–260) © 2021 Scrivener Publishing LLC
221
222 Recognition and Perception of Images important to provide a new and useful approach to digital image processing. We attempted to introduce the new and engineering strategy in the total amount of image processing and recognition, based on the theory of Information Granulation (IG) by Lotfi A. Zadeh [Zadeh, 1997] and Y. Yao [Yao, 2000]. The first section introduces the fundamentals of the image processing techniques, and also provides a window to the overall organization of the chapter. The second section deals with the main principles of human visual perception. In the third section the mathematical preliminaries of digital image representation are introduced. We have introduced the fundamentals of the IG theory in section 7.5. To top it all, in section 7.6 the main theoretical and practical results of our granulation technique are introduced. Section 7.7 is devoted to the entropy-preserved data granulation. Section 7.8 is dedicated to the greedy granulation algorithms. The applications of the new approach in digital image processing are considered in section 7.9. The audience for this chapter will be undergraduate and graduate students in universities all over the world, as well as teachers, scientists, engineers and professionals in R&D and research labs, for their ready reference.
7.2 The Image Processing and Analysis Base Conceptions Human perception has the capability to acquire, integrate, and interpret all this abundant visual information around us with a variety of forms and shapes, colors and textures, motion and tranquility [Marr, 1982]. The automatic visual systems must achieve the final results, compatible with the human perception abilities.
7.2.1 The Main Stages of Image Processing The initial step towards creating an image analysis system is digital image acquisition using sensors in optical (or another) wavelength domains. A two-dimensional image that is recorded by these sensors is the mapping of the three-dimensional visual world. The captured two-dimensional signals are sampled and quantized to yield digital images [Acharya & Ajoy, 2005]. Segmentation is the process that subdivides a digital image into a number of uniformly homogeneous regions. Each of these regions is a constituent part or object in the entire scene image. In other words, segmentation of an image is divided by a set of regions that are connected and non-overlapping.
Mathematical Apparatus of Spatial Granulation 223 After extracting the set of segments, the next task is to define a set of meaningful features such as texture, color, and shape. These are important measurable entities which give measures of various properties of image segments. Each segmented region in a scene may be characterized by a set of such features. Finally, each segmented object is classified to one of a set of meaningful classes. The problems of scene segmentation and object classification are two integrated areas of studies in machine vision. Another practically useful aspect of image processing involves compression and coding of the visual information. The storage requirements of digital imagery are growing explosively for the classical databases. Compact representation of image data and their storage and transmission is a crucial and significant area of development today.
7.2.2 The Fundamentals of a New Hybrid Approach to Image Processing This chapter considers a new approach to image processing and analysis, based on an advanced algebraic method of digital images modeling and processing. A new method is based on the representation of images as the set of compact geometric objects on the image data space, which objects are called granules, according to L. Zadeh’s definition. Properties of such representation are studied; in particular, the concept of complex rotation of a digital image is introduced.
7.2.3 How is this New Approach Different? With the growth of diverse applications, it became a necessity to provide a fresh look at the novel image processing approach. In our knowledge there is no other technique that provides all the stages of image processing in the scope of problems. 1. We present the very common mathematical model for all the stages of image processing and analysis. It’s not necessary to spend computing resources and processing time for the data transformation between the processing stages. 2. The provided compact presentation of digital images is very efficient for the data storage and retrieval in the databases. 3. We provide the efficient processing algorithms from the greedy algorithms family for the initial data processing (data granulation).
224 Recognition and Perception of Images 4. There is no necessity for data compression/decompression operations. 5. The developed mathematical model algebra are very abundant for the data operations and relations. 6. On the final stage there is no necessity in the special image visualization procedures. Our new technique is the mathematically completed implementation of the Granular Computing Guiding Principle, proposed by L. Zadeh [Zadeh, 1999].
7.3 Human Visual Perception Modeling The main paradigm of Zadeh’s Theory of Information Granulation is the mathematical modeling of the human ability to process and understand raw images, corrupted by different kinds of distortion and misrepresentation [Zadeh, 1997]. In common sense, each discrete image may be viewed as the model of a large-scale system, because not even a large image usually contains more than a hundred elements (pixels). Another image attribute is the high complexity and the lack of understanding of semantic relations between both single pixels and groups of pixels (clusters or granules). According to the Theory of Information Granulation, the mentioned relations are the main information medium. The relations structure is naturally hierarchical, very badly formalized and mathematically inexpressible [Zadeh, 1979]. More generally, in the context of Granular Computing (GrC) a basic assumption is that an image (especially a natural image) may be interpreted as a complex constraint (relation). In this interpretation, binary (bilevel) image is the canonical form of image. According to Zadeh’s Guided Principle, the depth of explanation (understanding) of image is a measure of effort involved in image transformation into its canonical (Euclidean) form (Zadeh’s explanation) [Zadeh, 1999]. The perceptual attributes of different wide families of corrupted and low-quality digital images may be classified for designing the very common mathematical formalisms.
7.3.1 Perceptual Classification of Digital Images There are different branches of images presented in the image processing and understanding experience. The same are very easily processed by
Mathematical Apparatus of Spatial Granulation 225 classical pattern recognition theory algorithms, but other kinds of images are not applicable for the automatic processing. The very important property of each kind of information is the internal structure and their complexity. According to [Butenkov, 2007], we consider the next classification for the different kinds of input images: 1. Structured images, for example the Euclidean geometry objects (points, straight lines, etc.); 2. Weakly structured images, the very frequent example is the previous images, but corrupted and (or) noised and overlapped; 3. Raw images, for example, the natural scene images [Butenkov, 2004]. Figure 7.1 presents the example of artificial image of enterprise objects scene. Also, well-known image processing techniques can be used only for the analysis of such kind of well-structured images. We can see that the main reason for the weakly structured images of the object-background is invisibility (camouflage). Figure 7.2 presents the
Figure 7.1 Well-structured scene images, designed for the image understanding learning.
Figure 7.2 Simplest weakly-structured (camouflaged) scene images.
226 Recognition and Perception of Images
Figure 7.3 Examples of raw scene images.
0.5 0 0 20 40 60 80 100 0
10
20
30
40
50
60
70
Figure 7.4 The different kinds of low-resolution digital image displaying [Butenkov, 2003].
scenes, that can’t be processed by the automated image analyzing systems [Pratt, 2007]. Figures 7.3 and 7.4 show the visually distorted images (raw images) as a result of the objects cluttering and the extremely low resolution of the video sensor. This is one of the important problems, the common problem of Automated Target Recognition (ATR) [Baldwin et al., 1997]. The main problem of object-background separation is stated by the recent researches of Gestalt psychology [Marr, 1982]. The main efforts in the mentioned domain do not provide a clear and global theory of perception; moreover, there are the well-known optical illusions with the changed object and background perceptions. Also, the researchers don’t provide the general and common algorithm for the object of interest segmentation.
7.3.2 The Vague Models of Digital Images According to Marr Paradigm in vision [Marr, 1982], the well-structured images may be easily decomposed to the digital features vectors. Also the
Mathematical Apparatus of Spatial Granulation 227 original image must be presented as the unified vector in the real value Euclidean space [Butenkov et al., 2016]. One of the main problems of image regularization is the presence of uncertainty. There are two main uncertainty sources [Worboys, 1998]: 1. Image distortion and the finally vagueness; 2. Knowledge distortion uncertainty. The vagueness is very important in the initial stages of image processing, because of the very complicated problem of separating each pixel from object to background [Erwig & Schneider, 1997]. Uncertainty is very usual for the last stages of the image understanding, because different kinds of mathematic models are presented. The final results of image processing are closely related with the original image approximation (as the one kind of regularization) [Butenkov, 2003]. The well-known approximation techniques provide the accuracy illusion, because the original image information changes with the information of feature vector [Worboys, 1998]. The one-dimensional representation for the two-dimensional data generates the uncertainty increasing in actual fact [Cooke & Bez, 1984]. In the recent works [Winter, 1999], [Butenkov, 2003] et al., the new concept of vague regions and vague processing in Rn was introduced. The mentioned results may be used both for the images and GIS data. Mainly two kinds of spatial vagueness can be determined: uncertainty is traditionally equated with randomness and chance occurrence and relates either to a lack of knowledge about the position and shape of an object with an existing, real boundary or to the inability of measuring such an object precisely [Worboys, 1998]. The key role in data (digital image) uncertainty and vagueness management [Butenkov, 2004a] belongs to the new approach of data modeling, called Information Granulation [Zadeh, 1997] or Granular Computing [Yao, 2000], [Bargiela & Pedrycz, 2002]. In the next section we will introduce the fundamental mathematical definitions for the image processing formalization.
7.4 Mathematic Modeling of Different Kinds of Digital Images In the present chapter we don’t consider the initial stages of image acquisition. The different facets of optic and sensors may be retrieved from
228 Recognition and Perception of Images numerous tutorials [Acharya & Ajoy, 2005]. We’ll consider the visual data already acquired and transformed to the digital images [Rosenfeld & Kak, 1982], [Pratt, 2007]. Such digital data models are in the fundamentals of all the processing algorithms design.
7.4.1 Images as the Special Kind of Spatial Data In our point of view, the digital images are the special kind of data in the multidimentional algebraic space [Butenkov, 2009]. The basic ideas of image modelling are restricted by the procedure of the human visual system [Marr, 1982] and human color vision [Wyszecki & Stiles, 1982]. Because the human eye contains the flat discrete grid of light sensors, the prime image model supposition based on the main concept of picture plane as the Euclidean R2 [Butenkov, 2009]. Figure 7.4 shows the different manner of digital image displaying. The mathematical fundamental of digital image modeling and processing are the algebraic systems theory and graph theory [Butenkov & Zhukov, 2009]. The very usual representation for the digital images is the rectangular matrix, which contains all pixels of the original image, taking into account the specified kind of pixels neighborhood [Kong & Rosenfeld, 1989]. If the image dimensions are M and N, we can define the image model as the matrix of altitude function values
f[i, j] = k, i = 1…M, j = 1…N, k ∈ {1,…,W},
(7.1)
where W is a highest value for f(·,·) on the original image [Acharya & Ajoy, 2005]. The graph T=(V,E) consists of a set V of vertices (or nodes) and a set E ⊆ V × V of pairs of connected vertices. In the unordered graph the set E consists of the unordered pairs (α,β). Each of pairs (α,β) in unordered graph is called as graph edge. The weighted graph is a triple T = (V,E,ω), where ω: E → R is a weight function defined on the edges. A valued graph is a triple T = (V, E, f), where f : E → R is a weight function defined on the graph vertices. The most important graph property is a connectivity. All sets of vertices, which have a common edge, called connected [Kong & Rosenfeld, 1989]. A digital grid is a special kind of graph. Usually one works with the finite rectangle grid D ⊂ Z2, where the vertices are called pixels. The size of D is the total amount of points in D. The set of pixels D can be related with the graph structure T = (V, E) by taking for V the domain D, and the certain subset of Z2 × Z2 for E, defining
Mathematical Apparatus of Spatial Granulation 229 the graph connectivity. Usual choice is the 4-connectivity, i.e., each point has edges to its horizontal and vertical neighbors. Another usual choice is the 8-connectivity where a point is connected to its diagonal neighbors additionally [Kong & Rosenfeld, 1989]. A digital gray-level image is a triple
I = (D, E, f),
(7.2)
where (D, E) is the graph (usually the digital grid (1)) and f : D → N is a function assigning an integer value to each image pixel p ∈ D. A function f(p) is called the gray value or altitude (considering f as topographic relief according to Figure 7.1). A particular kind of digital images are the binary images GB = (B, E, f) with the value function f : D → {0,1}. The binary images may be explained as a result of image segmentation from object towards the background (see Section 1.1). This kind of digital image may be obtained from gray-level image by thresholding algorithm for the selected threshold value a ∈ :
B = {p ∈ D | f(p) ≤ a}.
(7.3)
The color digital images are the composition of a few spectral range intensity-level images (7.1) [Wyszecki & Stiles, 1982]. An image is thus defined by a set of regions that are connected and non-overlapping, so that each pixel in the image acquires a unique region label that indicates the region it belongs to. The set of objects of interest in an image, which are segmented, undergoes subsequent processing, scene description [Acharya & Ajoy, 2005]. Digital image segmentation involves formally partitioning an image into a set of homogeneous and meaningful regions, such that the pixels in each partitioned region possess an identical set of properties or attributes. These sets of properties of the image may include gray levels, contrast, spectral values, or textural properties. Each of segmented regions is usually called the object of interest. The image area non-includes the object of interest called the scene background. A complete segmentation of an image I (7.1) involves identification of a finite set of objects (regions) I1 , I 2 , ..., I N such that 1. 2. 3. 4.
I = I1 ∪ I2 ∪…IN. Ii ∩ Ij = ∅, ∀i ≠ j. P(Ii) = true, ∀i. P(Ii ∪ Ij) = false, i ≠ j.
230 Recognition and Perception of Images There may exist a number of possible partitions (set of objects), but the selection of an appropriate set of regions depends on the choice of the property P association in the region [Baldwin et al., 1997]. In the next section we propose the necessary concepts for the image processing algorithms design [Rogozov et al., 2013]. Most requirements (7.1)-(7.4) may be predefined by the mathematical topology [Kong & Rosenfeld, 1989].
7.4.2 Fundamentals of Topology and Digital Topology In the Euclidean space, point set topology has turned out to be an appropriate theory to characterize topological relationships between spatial objects. Therefore, we would like to transfer its well-known and desirable properties to an appropriate finite topology. An arbitrary set pair (X, T) can be defined as topological space if the axioms below are satisfied: 1. X ∈ T, ∅ ∈ T; 2. U ∈ T, V ∈ T ⇒ U ∩ V ∈ T; 3. S ⊆ T ⇒ U ∈T .
U ∈S
A set T is called as topology for X. The elements of T are called open sets, their complements in X – closed sets. Point set topology mainly considers infinite point sets having the property that in an arbitrarily small neighborhood of a point infinitely many other such points exist. This contradicts the nature of a discrete grid-based point whose neighborhood contains at most a finite number of other points. Digital topology [Kong & Rosenfeld, 1989] is the study of the topological properties of discrete images and is therefore a possible candidate for modeling topological properties of crisp and fuzzy discrete spatial objects. But the following short summary reveals its fundamental conceptions on a discrete domain compared to point set topology on a continuous domain. The underlying space of digital topology is the digital plane Z2. Let S ⊂ 2 Z . The points in S are then called black points, and the points in Z2 − S are termed white points. The possible way to image processing with the minimal uncertainty increasing is the data regularization [Yao, 1976], [Zadeh, 1999], [Winter, 1999], [Butenkov, 2003].
7.4.3 Regularity and the Digital Topology of Regular Regions In the different fields of spatial data processing (images, geoinformation systems data, etc.) we can detect the kinds of data, where the registered objects
Mathematical Apparatus of Spatial Granulation 231 are the shapeless domains of data. The usual method of such objects description is the domain border separation and measuring [Worboys, 1998]. The point set topology distinguishes different parts of an arbitrary point set A ⊂ X, namely its boundary ∂A, its interior A°, and its exterior A− which are pairwise disjoint. The union of A and ∂A corresponds to the closure A of A. An arbitrary topological space domain A called regularly closed if the constraints below are satisfied:
A = A .
(7.4)
The sets of regularly closed domains will be called regularly closed objects [Butenkov, 2003]. To provide the full regularly closed data set we can perform the regularization operation:
reg ( A) = A .
(7.5)
As a result of regularization, each open domain must be presented as the nearest closed domain, all holes must be filled, etc. [Butenkov, 2004a]. The main problem is to provide the optimal alternative between all the kinds of digital topology for the regular shape representation (see above paragraph). According to [Butenkov, 2003] an interior and a boundary of regular domain must be different. We can introduce the next notation for the same objects topology as M/N topology, where M – the boundary topology, and N – the interior topology. A few examples of such objects are shown in Figure 7.5. According to our purposes, only the 4/4 objects may be regularly closed according to (7.4) because for other kinds of topology the neighborhood between the domains do not hold true. Also, the regular 4/4 objects do not suffer from the digital topology paradoxes [Kong & Rosenfeld, 1989]. Note that according to (7.4) on the one hand, a regularly closed domain must be closed (e.g., ∂A ≠ ∅), but on the other hand it’s interior may be an
(a)
(b)
(c)
Figure 7.5 The different types of interior/exterior topology objects: (a) – 8/8 topology; (b) – 4/8 topology; (c) – 4/4 (regular) topology.
232 Recognition and Perception of Images G1
reg(G1) reg(G3)
G3 G2
reg(G2)
Figure 7.6 Vague regions and their regularization example.
empty set (e.g., the single point with 4-neighbors topology will be regularly closed). Thus, each set of points on an arbitrary grid may be covered by the set of regular domains and may be presented as regularly closed object [Butenkov, 2009]. Figure 7.6 shows the example of an arbitrary region on the picture plane and the result of domains regularization by the 4/4 domains covering. In analytical geometry all 4/4 topology objects are called as Cartesian domains. The data representation by the Cartesian regions covering are the fundamental idea of Theory of Information Granulation (TIG), introduced by Lotfi A. Zadeh [Zadeh, 1997]. Advances of TIG is used in our papers to provide the special kind of data granulation – spatial granulation [Butenkov et al., 2016].
7.5 Zadeh’s Information Granulation Theory Granularity is an intrinsic feature of an object itself [Bargiela & Pedrycz, 2002] and describes the vagueness of an object which certainly has an extent but which inherently cannot have an exactly fixed boundary (e.g., between a nose and cheek). At least four alternatives are proposed as general design methods: 1. Exact models [Pratt, 2007] which transfer type systems and concepts for spatial objects with sharp boundaries to objects with vague boundaries; 2. Models based on rough sets [Yao, 1976] which work with lower and upper approximations of spatial objects; 3. Probabilistic models [Acharya & Ajoy, 2005] which predominantly model positional and measurement uncertainty; 4. Models based on fuzzy sets [Baldwin et al., 1997] which predominantly model fuzziness.
Mathematical Apparatus of Spatial Granulation 233 Traditional granular modeling approaches are mainly synthesis models that use one-dimensional features. These flat modeling approaches work very well in problem domains, where there is little or no dependency between the input variables. A general issue of Granular Computing (GrC) was given in a recent paper by L. Zadeh [Zadeh, 1979] and Y. Yao, which introduces the term “Granular Computing” [Yao, 2000]. The importance of the concept of a Cartesian granule derives in large measure from its role in what might be called encapsulation. More specifically, consider the granules Gx and Gy, defined by a possibilistic constraint [Zadeh, 1997]. The Cartesian granule, G+ = Gx × Gy encapsulates such one-dimensional granules in the sense that it is the least upper bound of Cartesian granules which contain the Cartesian product of original granules. Thus G+ granule can be used as an upper approximation of original granules [Zadeh, 1999]. Figure 7.7 show the Cartesian granulation (encapsulation) of digital image [Butenkov et al., 2006]. The presented technique encapsulates arbitrary shape regions by regular Cartesian granules (see Figure 7.8). As a result we have only 11 information granules, but that is enough to separate the object parts from image background. Figure 7.9 shows the spatial representation of encapsulated digital image. It is clear that the presented encapsulated model assumes the entropy increasing, but we must use the optimal granulation algorithm to minimize the loss of information [Butenkov & Zhukov, 2009]. The presented technique allows us to provide the most common visual perception Paradigm [Marr, 1982]. We present a new approach for the geometric reasoning and operations with the spatial data under uncertainty. The approach is based on Panties
Hair, T-shirt
Background
Skin
(a)
(b)
Figure 7.7 The example of histogram analysis: original image (a) and the primary grey levels histogram (b).
234 Recognition and Perception of Images Y
Y
Y
Y
Y 4 5 8
2
1
6
9
10
7 11
3 (a)
X
X
(b)
(c)
X
X
(d)
(e)
X
Figure 7.8 The example of digital image encapsulation (gray-level image object detection): original image (a), the successive steps of image part detection and encapsulation (b) – (d) and the final step of granules sorting (e). Brightness
255
8 9
10 4 6 7 0
1 11
Y
5
1 3
2 2
X
Figure 7.9 3-D view of encapsulated image granules.
geometric regularization and uses an estimation of Shannon’s entropy for the granulated data. The main advantages of developed granulation technique are: 1. The crisp granules obtained, which may be analytically presented and topologically regularized; 2. The mentioned granules are intrinsically related with the rectangular grids and data representations;
Mathematical Apparatus of Spatial Granulation 235 3. The model which is invariant under the group of translations in digital space is presented; 4. The dual models over the granules, which can present the objects with the different dimensions (points, vectors and areal objects) are obtained; 5. Compact and computationally light data representation was presented. This approach is successfully applied at different stages of image processing (feature selection, background/foreground separation, data compression, etc.).
7.6 Fundamentals of Spatial Granulation In this chapter we propose a new view to the approaches to represent the large digital images, that make use of Information Granulation Theory and Granular Computing methods. A framework of GrC can be improved by applying the inherited principles of Spatial Granulation. We examine a proposed framework from new perspectives of granular computing.
7.6.1 Basic Ideas of Spatial Granulation Granular data models, as the name stipulates, are modeling constructs that are built at the level of information granules. Mappings between the granules express the relationships captured by such models. The granularity of information that is explicitly inbuilt into the construct offers interesting and useful features of the model including its evident transparency and flexibility. Fuzzy rule-based systems (models) are typical and commonly encountered examples of granular models. These systems are highly modular and easily expandable fuzzy models composed of a family of conditional statements (rules) where fuzzy sets occur in their conditions and conclusions [Zadeh, 1999]. In general, we may talk about rules embracing information granules expressed in any other formalism. Such models support a principle of locality and a distributed nature of modeling as each rule can be interpreted as an individual local descriptor of the data (problem) which is invoked by the fuzzy sets defined in the space of conditions (inputs). To provide the common formalisms, including both algebraic and geometric formalisms, we provide the common approach to the Spatial Granulation in the abstract vector space [Butenkov, 2009], [Butenkov & Zhukov, 2009]. Especially in the Information Granulation Theory, spatially or temporally adjacent data values are thought to form of abstract patterns, which can be regarded as ordered sets of real numbers.
236 Recognition and Perception of Images
7.6.2 Abstract Vector Space Let K is an arbitrary number field, and n ∈ – a number. Then, the ndimensional vector over the K number field is the tuple of n K elements. The tuple elements called the vector coordinates. n-dimensional number space over the K number field is the population of all n-dimensional number vectors over the number field. To use the common data types we bring the very common definitions for the abstract vector space [Butenkov & Zhukov, 2009]. Definition 1. An arbitrary set of elements L called the vector space over the K number field in case of: a) There is the algebraic operation, for element pair a, b ∈ L, the result of one called sum c = a + b, c ∈ L. b) Also there is the another algebraic operation for the element a ∈ L and number k ∈ K, the result of one is c = ka, c ∈ L. c) Both operations are satisfied to next axioms: I. For an arbitrary element a, b, c ∈ L there are the next correct statements: a) a + b = b + a (commutability), b) (a + b) + c = a + (b + c) (associativity). II. In L there is the 0 element (zero element) that a + 0 = a for all elements a ∈ L. III. For all elements a ∈ L there is the element −a, called the inverse to a, that a – a = 0. IV. For an arbitrary elements a, b ∈ L and an arbitrary numbers k1 and k2 from K number field there are the next statements: a) k1(k2a) = (k1k2)a, b) (k1 + k2)a = k1a + k2a, c) k1(a + b) = k1a + k1b. V. For each element a ∈ L there is correct the statement 1a = a. According to the given axioms I-IV, every element, satisfying to all axioms is the abstract vector. Let m-dimensional vector space, for the m-fold axis there is the basis vectors (orts) e1, e2,…,em. Each abstract vector V can be presented as the next decomposition V =
m
∑v e , where v is the vector coordinates. In case of i i
i =1
i
basis vectors e1,e2,…,em are unmatched for the current problem, this abstract space is affine abstract space if not – Euclidean (metric) abstract space.
Mathematical Apparatus of Spatial Granulation 237 For the very important families of physical data and scientific data the problem of space metric is very important. The most of spatial data analysis and processing methods are based on the space metric. Because there are presented some very different approaches to the measure problem, many popular metrics are introduced coarsely and for the incorrect basic statements. A very common approach to data modeling is affine modeling [Butenkov, 2009].
7.6.3 Abstract Affine Space Granular data models, as the name stipulates, are modeling constructs that are built at the level of information granules. Mappings between the granules express the relationships captured by such models. Affine space is very useful for the common figure properties studies if an arbitrary coordinates transformation is demanded. In the affine space we provide the very common geometry methods for the data representation [9]. The main term for our approach is the determinant in affine space. Definition 1. n–dimensional determinant in affine space is the function of n n-dimensional vectors, like F 1 a ,2 a ,…,n a = 1 a ,2 a ,…,n a , that satisfies for the next axioms:
(
)
a) F(1a, 2a,…,na) is linear functions for the each argument. b) If there is the pair of linear depended vectors between 1 a, 2a,…,na, then F(1a, 2a,…,na) = 0. c) e ,2 e ,…,n e = 1. Note that we will put the vector numbers on the left side of symbol, so that doesn’t take it with the vector index. From the geometric point of view, the n-dimensional determinant value defines the certain value, correlated with the figure, based on n vectors in the determinant [Butenkov, 2000] (oriented area, oriented volume, etc.). In our papers the special case of geometric models are provided for the affine space vectors. It’s necessary for the geometric interpretation of original data vectors.
7.6.4 Cartesian Granules in an Affine Space The common definition of Cartesian information granule by L.A. Zadeh may be improved for the Cartesian granules in an affine space. Definition 2. Let 1G,…,nG is an arbitrary information granules with the dimension m = 1 for the numerical variables U1,…,Un, then the Cartesian product Gn = 1G×⋯×nG is the Cartesian granule with the dimension n.
238 Recognition and Perception of Images Cloud of points in affine space with the projections on the different axes projx1 A and projx 2 A must be covered by the Cartesian product of projections +G2 with the dimension n = 2. A similar technique is called as information encapsulation by L. A. Zadeh in [Zadeh, 1979]. Also, the next definition provided in [Butenkov, 2009]: Definition 3. The Cartesian granule +G, defined as +G = Gx1 × Gx 2, is the encapsulation of original information granule G in the sense of supremum of the family of Cartesian granules, contains granule G. With the geometry point of view, the Cartesian granule is the Cartesian product on the affine space basic axis. Spatial encapsulation example for n = 2 is presented in the Figure 7.10. Cartesian granule model (algebraic model) introduced in our papers as the special case of determinant. In case of Cartesian coordinates the model parameters are defined by the n affine space vectors ix, i = 1, …,n: min +
Gn ( A) =
max
1 1
x
1 1
x
… 1 1 max x
1 1
x
σ (−1)n
1 1
σ (−1)n
… 1 1 max x
… σ (−1)n
1 1
x
…
min
1 1
…
min
… 1 1 max x
… …
min min
x
x
(7.6)
The basic vectors of (7.6) model are the corners of Cartesian granule, presented on the Figure 7.11. The model calculation method (granule encapsulation) are established on the determinant properties [Butenkov, 2009], and uses the same model (7.6). For the n = 2 affine space we can provide the common formula, based on (7.6). We must use the m points (ix1,i x2), i = 1,2,…,m of original cluster x2 2 max x
+G = proj 1 A 2 x
× proj x2 A iX ( i x1, ix2 )
proj x2 A A 2 min x 1 min x
proj x1 A
1
max x
x1
Figure 7.10 Example of a spatial encapsulation by the +G2 Cartesian granule.
Mathematical Apparatus of Spatial Granulation 239 +
x2 G2
C
i
i
j
( + G2
G2
G2 )
C
i
j
G 2 ( G2 , G 2 )
j
G2
+
G2
j
( G2
C
G2
j
C
j
i
( +G2
G2 )
x1
G2 )
Figure 7.11 Subsets of encapsulated Cartesian granules on the plane.
A (see Figure 7.11), to calculate the encapsulating granule +G2 parameters by means of algebraic operations
+
G2 ( A) =
min(i x 1 )
min(i x 2 )
1
max(i x 1 )
min(i x 2 )
1 , i = 1, 2,…, m. (7.7)
max(i x 1 )
max(i x 2 )
1
The similar formulas may be provided for an arbitrary space dimension n. Next, there is the encapsulate granule (7.6) refinement may be calculated by the different criteria of granule quality. For the granulated space refinement we introduce the common formula, based on (7.6) and (7.7) as:
+
G2 ( A) =
min(i x 1 )
min(i x 1 )
…
min(i x 1 )
σ (−1)n
max(i x 1 )
min(i x 1 )
…
min(i x 1 )
σ (−1)n
… max(i x 1 )
… max(i x 1 )
… …
… max(i x 1 )
… σ (−1)n
. (7.8)
240 Recognition and Perception of Images Presented determinant in affine space related with the measure over the figure, covering the original data cluster (area, volume, etc.). On the basis of measure of the granule we provide the family of derivative measures on granule models (7.6) to organize the collection of formulas for the granule manipulation.
7.6.5 Granule-Based Measures in Affine Space The measure concept is the basis for some mathematic methods and a lot of different techniques [Bargiela & Pedrycz, 2002]. For the pair of granules iG, j G ∈ G obtained the associativity measure AS(i G, j G) = m(i G∩ j G ) m(G ). For the same pair iG, jG ∈ G the covering measure obtained as CV(i G, j G) = m(i G∩ j G ) m( j G). A very important granule measure is the similarity measure for the n–dimensional granules pair iGn, jGn ∈ G, that encapsulated by the ndimensional granule Gn+ i Gn , j Gn ∈G, obtained according to (7.8) as:
(
min max
i
x1
i
1
x
min
(
j
)
SIM Gn , Gn =
max
i
x1
i
x2
i
2
x
...
min
... ... i 2 x ...
min
max
( max ( min
max
min
...
min
... i
)
(
min max
j 1 x 1 ,min x
i
1 j 1 max
... max
i
j 1 x 1 ,max x
i
x n σ (−1)n x
n
σ (−1) + ... n σ (−1)
( min ( min
) (
min
n
... i n x
) x , x )
i
i
min min
) x , x )
i
j 2 x 2 ,min x
i
2 j 2 min
... max
i
j n x 2 ,max x
)
max
j
x1
j
1
x
j
j
x1
x 2 ... x
... ... j 2 x ...
min
( min (
...
min
(
i
j n x n ,min x
i
j n x n ,min x
... max
i
j n x n ,max x
j
x n σ (−1)n
j
x n σ (−1)n
... ... j n x σ (−1)n
min
min
...
min min
... min ...
2
...
min
... max
j
min
)
) )
σ (−1)n σ (−1)n ...
σ (−1)n
(7.9)
Note that for the data granulation by (7.12) measure we don’t need to distance measure in the data space against all the famous data clustering techniques. Most of such methods are obtained only for the Euclidean space, not for affine space [Acharya & Ajoy, 2005]. The proposed binary measure doesn’t have the transitivity property, because it’s impossible for the non-primitive, compound objects. Nevertheless we can provide the correct fuzzy classification procedure by the fuzzy transitive closure technique [Tamura et al., 1971].
7.6.6 Fuzzy Spatial Relation Over the Granular Models Geometrical content of fuzzy relation (7.9) based on the different subsets of granules examination, according to the main ideas of [Winter, 1999], are developed in our papers. The next figure shows the Euler diagram for the subsets of encapsulated granules in R2.
Mathematical Apparatus of Spatial Granulation 241 Non-crossed
Touch by side
Crossed
Contained
Contains
Matched
Figure 7.12 The common positions of two granules diagram.
In the current paper we will consider only plane topological relations between the flat objects [Winter, 1999], because the digital images are presented on the picture plane [Rosenfeld & Kak, 1982], but all the same relations may be designed for an arbitrary space dimension. The full taxonomy matrix of topological relations, designed on the basis of subsets relations (Figure 7.11) may be defined as
i j
G2 j G2
(
G2 + G2 i G2
i
)
+
G2
(
G2 + G2 j G2
( G ( G i
2
+
2
j
) G2
))
, (7.10)
The common approach to the using of encapsulation +Gn subsets may be propagated to an arbitrary space dimension [Butenkov, 2007]. The common diagram of two granules iGn and jGn (7.6) presented in Figure 7.12. The topology relations matrix (7.10) is used in our papers for a scenes image description and an arbitrary objects structure analysis [Butenkov, 2004].
7.7 Entropy-Preserved Granulation of Spatial Data At the following text we will use the following notation. Let h(g) is the brightness histogram and p(w), w = 0.W is the probability mass function of the gray-level, where W – is maximum brightness level.
242 Recognition and Perception of Images In context of recognition, the image foreground (object) is the set of elements (e.g., pixels) with brightness less than defined T value, (T ∈[1, W−1]) while the background elements have brightness value above this threshold [Butenkov, 2007]. The probabilities for an element belongs to foreground and background will be
pf(w), 0 ≤ w ≤ T, pb(w), T + 1 ≤ w ≤ W,
(7.11)
respectively. The cumulative probabilities are defined as T
Pf (T ) = Pf =
W
∑ p(w), P (T ) = P = ∑ p(w ).
(7.12)
b
b
w =0
w =T +1
The Shannon entropy is parametrically dependent upon the threshold value T for the foreground and background: T
H f (T ) = −
W
∑ p (w )log p (w ), H (T ) = − ∑ p (w )log p (w ). f
b
f
w =0
b
w =T +1
b
(7.13)
The total digital image entropy will be
H(T) = Hf(T) + Hb(T).
(7.14)
For various other definitions of the entropy in the context of tresholding and regularization, with some abuse of notation, we will use the same symbols Hf(T) and Hb(T). For the particular case of binary digital image the total entropy of scene digital image can be defined as the foreground and background entropies sum:
pf H 2 = H f + Hb = − p f log( p f ) − pb log( pb ) = − p f log 1− pf
− log(1 − p f ).
(7.15)
Mathematical Apparatus of Spatial Granulation 243 The measure of data corruption may be founded by the difference between the original image entropy (16) and granulated image entropy. Let V is the total amount of 2D granules (6) over the M × N binary discrete image. We can calculate the total entropy of granulated 2D image as V η ηG2 i n HG2 = i=1 log 2 1 − i=1 M×N M×N V
∑
∑
2 Gni
V V 2 ηGi ηG2 i n n i=1 i =1 − log 2 1 − , M×N M×N
∑
∑
(7.16) min 2 Gni
where η =
max max
i 1
x
i 1
x
i 1
x
min min max
i
x2
σ
i
x
2
σ
i
x2
σ
i + – the area of each 2D granule G2
according to (6). Also, the total quality criterion for the global digital image granulation is the minimization of the difference:
(
)
J = min H 2 − HG2 . V G2i
(7.17)
i =1
On the basis of total entropy criterion (18) we can design a number of common granulation algorithms [Krivsha et al., 2016].
7.8 Digital Images Granulation Algorithms For the user-oriented computing systems the first restriction related with the high computation speed. Main operations with the digital image processing may provide the real-time computing mode, which means that the used algorithms must be very fast (with the minimal complexity). The appropriate kind of very efficient algorithms are the greedy algorithms with the linear computational complexity. The presented granular computing technique assured the greedy algorithms for the image granulation [Rogozov et al., 2013].
244 Recognition and Perception of Images
7.8.1 Matroids and Optimal Algorithms The procedural basis of the masse efficient greedy algorithms is the transformation of original problem to the problem over the matroid. There is the very common theorem, proved that on the finite set with the non-negative weight function we can design the greedy algorithm for the problem of minimal weight subset. A lot of famous information processing problems may be transformed to the similar problem, for example, the restricted clustering problem [Krivsha et al., 2016]. Definition. Matroid M = E , ε is the finite set E with the norm |E| = n and the family of its subsets ε ⊂ 2E, satisfied to the next axioms:
M : ∅ ∈ε ; 1 M 2 : A ∈ε & B ⊂ A ⇒ B ∈ε ; M : A, B ∈ε & |B|=|A| +1 ⇒ ∃e ∈B \ AA ∪ {e} ∈ε . 3
Consider now the matroid M and the weight function w : E → R+. The standard optimization problem on the matroid is the search of element X ∈ ε, that satisfies to the next problem:
w( X ) = max w(Y ), where w( Z ):=
Y ∈ε
∑ w(e),
(7.19)
e∈Z ⊂ E
e.g., find in the restricted family the maximum weight subset [Rogozov et al., 2013]. For the different applications we can change the weight function w meaning and reduce a lot of famous optimization problems to the maximization problem (7.19). Now consider the two kinds of algorithms for the granulated data clustering – “top-down” and “down-top” algorithms. Both algorithms are based on the information measure for the granules, calculated by means of topological similarity relation in data space, but the different approaches to the granules selection are used.
7.8.2 Greedy Image Granulation Algorithms We’ll provide the two greedy algorithms for the digital image granulation. Both algorithms are the greedy algorithms (with the linear complexity of computations), but the first one is the division algorithm and it’s efficient
Mathematical Apparatus of Spatial Granulation 245 for the solid type of input data. Another algorithm is the agglomerative algorithm and it’s efficient for the sparse raw data sets [Krivsha et al., 2016]. Algorithm 1 (top-down algorithm) It provides the division of data axes to the equal intervals (uniform mesh design) with the subsets of mesh cells optimization. Input data: Set of raw data points DtSet. Output data: Set of granules GrSet. С1. Divide the Cartesian axes to the equal intervals. С2. Find the set TmpDtSet of all subsets PartSeti of current partition. С3. Initialize the output data set GrSet = TmpDtSet. С4. Challenge the all subsets of current subset PartSeti of output data set GrSet. С4.1. If PartSeti ≠ ∅, then compress the subset, else delete the subset from the GrSet. С5. Output the output data set GrSet. Presented algorithms are based on the initial partition of the raw data by means of mesh to the set of artificial clusters and their optimization (compression) or deleting. Algorithm 2 (down-top algorithm) It provides the binary tree challenge on the base of information measure thresholding. Input data: Set of raw data points DtSet, information criterion (7.17) threshold value β. Output data: Set of granules GrSet. С1. Initialize the output data set GrSet = DtSet. С2. Challenge the all subsets of current subset GrSeti of output data set GrSet. С2.1. Calculate the information criterion InfMeg(GrSeti) for the GrSeti subset. С2.2. If InfMeg(GrSet) < β then: С2.2.1. Challenge the i-th axis of the Cartesian basis. С2.2.2. Divide the i-th axis interval of GrSeti subset to the two equal intervals. С2.2.3. Define for the each i-th axe two new subsets of output set GrSeti.
246 Recognition and Perception of Images С2.2.4. Find the optimum division by the information criterion. С2.2.5. Change the output data set GrSeti of the original data set GrSet to two new subsets. С3. Output the result of combining for the set GrSet. Both 1st and 2nd algorithms provide the output result as a set of granules (partition set) iG ∈ G, i = 1,2,…,n for the input set of data points. To the next step of data optimization we must aggregate the conglutinate granules from the output data set. The next algorithm provides the granules aggregation by the linear complexity of computations. Algorithm 3 (granules aggregation) The presented algorithm is different from the non-granulated data clustering algorithms like k-means, ISODATA and other popular algorithms, because the number of initial clusters may be calculated automatically. It’s very useful for practical purposes. Input data: Set of raw data points GrSet, encapsulation threshold α. Output data: Set of granule clusters ClSet. С1. Initialize the output set ClSet = GrSet. C2. Calculate the cardinal value of ClSet CardSet = |ClSet|. С3. Challenge the all granules from ClSet: С3.1. Take the current granule iGr from the ClSet. С3.2. Examine the remaining granules from ClSet. С3.2.1. Take the current granule iGr from the last granules of ClSet. С3.2.2. Calculate the similarity measure SIM(iGr, jGr) (7,9) between iGr and jGr. С3.2.3. If SIM(iGr, jGr) > α then: С3.2.3.1. Encapsulate the granules iGr and jGr to +Gr (7.8) granule. С3.2.3.2. Delete the iGr granule from ClSet. С3.2.3.3. Change the jGr granule in ClSet to the encapsulated granule +Gr. С4. Output the set of ClSet. Figure 7.13 shows the examples of point cloud images granulation by the 1st and 2nd algorithms. The final optimization of output granules set was performed by the 3rd algorithm [Krivsha et al., 2016].
Mathematical Apparatus of Spatial Granulation 247
(a)
(b)
Figure 7.13 (a) the result of the Algorithm 1 execution over the solid points cluster; (b) the same for the Algorithm 2 execution over the low-density cluster.
The examination of two class of granulation algorithms implementation was implemented. The first class is the “top-down” algorithm, ant it’s very good for the solid point sets because it allows to get less output granules than another algorithm. In case of algorithm type 2 (“down-top” algorithm) we can work with the sparse data point sets, but the computational complexity will be more than for the first type algorithm.
7.9 Spatial Granulation Technique Applications Overall Cartesian granule features open up a new and very exciting avenue in Granular Computing models [Butenkov, 2009]. The main practical results are related to the full set of techniques for the completed image processing and analysis (Visual Perception) which was implemented as the large program library of digital image processing software system [Rogozov et al., 2013].
7.9.1 Granulation of Graphical DataBases Because in the field of graphic databases (GD) continuous growth of volumes and difficulties of the storable graphic data (GD) [Pedrysz, 2007] is observed, there is a need of development of new approaches (and methods on their basis) to procedures of representation and processing of the GD which would allow considerably (on orders) to increase productivity of the management systems graphic databases (GDMS). The new methods are based on introduction of the indistinguishability relation over the subsets of points of the GD, that leads to radical enlargement of units of
248 Recognition and Perception of Images information representation [Zadeh, 1997]. This process is called as spatial granulation [Butenkov et al., 2016]. We will consider an approach to the solution of the task of creation of the universal of GDMS and their optimization from the point of view of spatial granulation. The classical theory and methodology of the organization of database management systems (DBMS) is based on application of relational data models. However, standard relational models aren’t suitable for the GD owing to multidimensionality of the GD. Both vector and raster models of the GD are used as an alternative. These models are feeble alternatives of normal forms of data representation. However, the similar DBMS requires operations of special conversion (vectoring or rastering operations) basic data for storage and, respectively, the reverse operations for processing. As a result, in GDMS the three-planimetric structure containing except an internal circuit (relational DBMS), a very essential means of processing of the GD on an input and an output of GDMS, is most often used (see Figure 7.14). For simplification of the given structure it is necessary to have the mathematical apparatus of the uniform description of different types of the GD allowing the avoidance of additional intermediate manipulations with the GD in case of their storage. The theory of granulation gives such a device [Rogozov et al., 2013]. The idea of granulation is based on use of property of indistinguishability of some item collections of GD. In particular, for graphic DB indistinguishability expresses in the strong topological connectivity of tuples of data [Rosenfeld, 1979]. Figure 7.15 shows an example of granulation of graphic data (image) within the theory [Zadeh, 1997], [Zadeh, 1999]. GD (contour No 3) (Integrated)
Base of the integrated files
GDB (contour No 2) (Iconographic)
Crude graphic data
The integrated files
Integrated converter
base of graphic graphic files files
DB (contour No 1) Crude numerical (Relational) data Norma lizer Tabular data Vectorizer GD Vectorized Graphic graphics query data
Graphic converter Data of the thesaurus
Base of table files
Result visualization of graphical data
Relation DBMS
SQL-query Interpreter of inquiries Data of the thesaurus Thesaurus interpreter
Interpreter of graphic inquiries
Table converter
Numerical result
Request numerical data
Request thesaurus data
Graphic query
Figure 7.14 Three-level structure of GDMS on the basis of relational DBMS.
Result visualization of integrated data
Mathematical Apparatus of Spatial Granulation 249 Head granules
Background Head
Face granules Source image
Character Body
Face
Figure 7.15 Methodology of granulation of the image by L. Zadeh.
Methods of geometry (generally – topology) application allows the squeezing of the GD in case of storage in GDB by granulation, considerably reducing the volume of storable data DB. The most important difference between the granules and the elements of a relational DB is the existence of inner algebraic pattern, i.e., representation by means of determinants of the special form. Such structures allow to store in the uniform form both vector, and raster the GD. As a result we can use simple single-circuit structure of GDMS where granulation plays the same role, as normalization of data the relational DBMS (see Figure 7.16). The criterion (7.17) of optimization of representation of the GD in similar systems explicitly is described in operations. It is based on use of Shannon entropy of source image before granulation and allows the minimization of information losses in the course of granulation (enlargement) of the GD. The algebraic operations entered in the quoted operations on granules allow a computing kernel of the granulated GDMS to be built, developing algebra of spatial granules from [Butenkov & Zhukov, 2009].
Crude numeric data
GDB (granulated)
Granulator
Result of visualization of graphic data
Basis of the granulated files Granulated data
The granulated DBMS
Crude graphic data
Granulated data
SQL query The interpreter the granulated requests
Requests according to the thesaurus of data
Request of numeric data
Graphics request
Figure 7.16 Optimal structure of the granulated GDMS.
Degranulator (Coercion to the given form of representation)
Numerical result
250 Recognition and Perception of Images The offered methodology of creation granulated by GDMS allows to receive as significantly the simplified structure of system (see Figure 7.17), and essential (on orders) compression of volume stored by the GD. The example of data granulation (according to Figure 7.16 structure) is examined in the next section.
7.9.2 Automated Target Detection (ATD) Problem The target (object of interest) detection and recognition problem is the initial step of all image processing algorithms [Acharya & Ajoy, 2005]. The main advantage of obtained estimations and granulation techniques is the non-probabilistic qualities. We don’t need the large samplings of pixels to provide accurate estimation values. As a result, the granulation-based algorithms are very appropriate for the low-resolution and visually distorted images [Butenkov, 2007]. Figure 7.17 shows the intermediate outcomes for automated detection of the camouflaged object. Note that the well-known (1)
((4))
((1) (1 1)
((2)) (2
(5)
(3)
(6)
Figure 7.17 The successive stages (1)-(6) of Automated target Detection for the camouflaged object. (1) – Source digital grey-level image. (2) – Binary image (the result of segmentation). (3) – The grid lines for the optimal image granulation. (4) – Algorithm 1 outline result (initial granulation). (5) – Algorithm 3 outline result (total granules encapsulation). (6) – The object capture frame for the original image.
Mathematical Apparatus of Spatial Granulation 251 results in ATD usually related with the moving target detection. Our software detects the motionless object on a non-homogeneous background.
7.9.3 Character Recognition Problem The character recognition problem is a very important part of the pattern recognition domain [Baldwin et al., 1997]. According to TIG basic principles, the outline of character recognition system is the alphabet information granulation [Baldwin et al., 1997]. The fuzzy recognition technique is a very fruitful kind of granulation technique [Pedrysz, 2007]. Our implementation of the fuzzy recognition technique is based on the common fuzzy similarity relation (7.9) for the pair of data granules. Figure 7.18 shows the consecutive stages of character images granulation and classification [Butenkov, 2007]. After the source image granulation by the Algorithms 1 and 3 we will have the granulated images, showed in Figure 7.19. On the basis of (7.9) relation model we’ll calculate the symmetric matrix of the membership functions for the each granulated image. Figure 7.20 displayed the 2D matrix representations, where the most bright cells correspond to maximum membership values, and the black cells to the zero membership value [Cooke & Bez, 1984]. For the final stage implementation each reference character images matrix are matched with the examined matrix [Baldwin et al., 1997]. The
Figure 7.18 The source character images for the classification.
Figure 7.19 The examples of granulated character images.
252 Recognition and Perception of Images 0
1
2
3
4
5
6
7
8
9
Figure 7.20 The fuzzy relation matrix graphic representations for the granulated images.
Figure 7.21 Corrupted character images examples.
very useful advantages of the new technique are the noise tolerance and the scalability for the examined image size. Figure 7.21 shows the examples of the noised and corrupted images that are not recognized by the popular software [Butenkov, 2000]. In our papers the different approaches to the recognition problem are examined – from the special kind of artificial neural network (granulated perceptron) to the fuzzy recognizing system [Baldwin et al., 1997].
7.9.4 Color Images Granulation in the Color Space Color is an important attribute of visual information. Color is related to chromatic attributes of images. Human color perception is concerned with physical phenomena, neurophysiological effects and psychological behavior [Wyszecki & Stiles, 1982]. Color is one of the most widely used visual features in image segmentation. Color has been successfully applied to compare images, because it has very strong correlations with the underlying objects in an image. Moreover, color feature is robust to background complications, scaling, orientation, perspective, and size of an image. Although we can use any color space for computation of a color histogram HSV (hue, saturation, value), HLS (hue, lightness, saturation), and CIE color spaces (such as CIELAB, CIELUV) have been found to produce better results as
Mathematical Apparatus of Spatial Granulation 253 compared to the RGB space. Since these color spaces are visually (or perceptually) uniform compared to the RGB. they are found to be more effective to measure color similarities between images [Acharya & Ajoy, 2005]. To apply the provided image models and algorithms for the color processing we must propagate the Cartesian granules models on the orthogonal curvilinear coordinates [Butenkov, 2009].
7.9.5 Spatial Granules Models for the Curvilinear Coordinates The very important property of similarity relation (7.9) is its affinity. The proposed relation are invariant under all the nonsingular coordinates transformation and, moreover, under certain kinds of nonlinear (orthogonal) transformations. It’s very useful for the classification in the color model space [Wyszecki & Stiles, 1982].
Polar coordinates on the flat In the 2D polar coordinates (ρ,φ) we have the next granule model for the limiting values of polar radius [1ρ, 2ρ] and polar angle [1φ, 2φ]: 1
1
ρ
2
1
ρ
2
2
ρ
ϕ
+
G2Polar ( A) =
ϕ ϕ
1
ρ + 2ρ 2 1 ρ + 2ρ 2 1 ρ + 2ρ 2
.
(7.20)
Presented model (7.20) are useful in the case of central symmetry objects images processing and recognition.
Cylindrical coordinates For extended model (7.20) in the 3D cylindrical coordinates (ρ, φ, z) we obtain the next granule model, defined by the limiting values of variables:
254 Recognition and Perception of Images 1
1
z
1
ρ
2
1
z
1
ρ
2
2
z
1
ρ
2
2
z
2
ρ
ϕ
+
ϕ
Cyl 3
G ( A) =
ϕ ϕ
1
ρ + 2ρ −2 1 ρ + 2ρ −2 1 ρ + 2ρ −2 1 ρ + 2ρ −2
.
(7.21)
The example of (7.21) application is the HUE color model (Figure 7.22(a)).
Conical coordinates According to (7.21), the equation of base granule in the conical coordinates (ρ, φ, z) defined by the limiting values of variables – polar radius [1ρ, 2ρ], polar angles [1φ, 2φ] and altitudes [1z, 2z]: 1
1
2
1
2
2
ϕ
(( 1 z )2 + 1 z 2 z + ( 2 z )2 )(( 1 ρ 2 z )2 − ( 2 ρ 1 z )2 ) 6( 1 z 2 z )2 ( 2 ρ − 1 ρ )
z
1
z
1
2
z
1
ρ
(( 1 z )2 + 1 z 2 z + ( 2 z )2 )(( 1 ρ 2 z )2 − ( 2 ρ 1 z )2 ) 6( 1 z 2 z )2 ( 2 ρ − 1 ρ )
2
z
2
ρ
(( 1 z )2 + 1 z 2 z + ( 2 z )2 )(( 1 ρ 2 z )2 − ( 2 ρ 1 z )2 ) 6( 1 z 2 z )2 ( 2 ρ − 1 ρ )
ρ
(( z ) + z z + ( z ) )(( ρ z ) − ( ρ z ) ) 1
+
G3Cone ( A) =
ϕ ϕ ϕ
ρ
2
1
2
2
2
1
2
2
2
1
2
6( 1 z 2 z )2 ( 2 ρ − 1 ρ )
.
(7.22)
The example of (7.22) application is the HSV color model (Figure 7.22(b)). More common example of HSV model granulation will be provided in the next section.
Spherical coordinates The very common and aplicable coordinate system in the modeling is the spherical coordinates (ρ, φ, θ). We provide the granule model for such
Mathematical Apparatus of Spatial Granulation 255
(a)
(b)
Figure 7.22 Different color space models in Cylindrical and Conical coordinates: (a) – cylindrical color space, (b) – conical color space [Wikipedia, 2019].
coordinate system, defined the limiting values [1ρ, 2ρ], [1φ, 2φ] and [1θ, 2 θ] as:
(( ρ) + ρ ρ + ( ρ) )(cos( θ ) − cos( θ )) 3( θ − θ ) 1
1
ϕ
+
G3Sphere ( A) =
1
ρ
1
1
ρ
2
2
1
ρ
2
2
2
ρ
ϕ
ϕ
θ
2
ϕ
1
θ θ
θ
2
1
2
2
2
1
1
2
2
(( 1 ρ )2 + 1 ρ 2 ρ + ( 2 ρ )2 )(cos( 1θ ) − cos( 2θ )) 3( 1θ − 2θ ) . (( 1 ρ )2 + 1 ρ 2 ρ + ( 2 ρ )2 )(cos( 1θ ) − cos( 2θ )) 3( 1θ − 2θ ) (( 1 ρ )2 + 1 ρ 2 ρ + ( 2 ρ )2 )(cos( 1θ ) − cos( 2θ )) 3( 1θ − 2θ )
(7.23)
Provided models (7.20)-(7.23) will be used for the color image data transformation in the appropriate color spaces and for the color image segmentation [Wyszecki & Stiles, 1982].
7.9.6 Color Histogram for Color Images Segmentation Color Histogram is the most commonly used color feature in HSV color model [Wyszecki & Stiles, 1982]. Color histogram has been found to be very effective in characterizing the global distribution of colors in an image, and it can be used as an important feature for image characterization. To define color histograms, the common granulating approach may be used [Yao, 2000]. Color space is quantized (granulated) into a finite number of discrete granules. Each of these granules becomes a bin in the
256 Recognition and Perception of Images histogram model. The color histogram is then computed by counting the volume (number of pixels) in each of these granules by the main model equations (7.20)-(7.23). Figure 7.23 shows the color histogram for the satellite image of sea and shore. To examine the color histogram from Figure 7.23(b) we must calculate non-linear color distribution function. To improve the operations over the color histogram we can transform the source distribution to the HSV color space. Figure 7.24 shows the seashore color histogram after the granulation in Conical coordinates according to model (7.22).
300 250 200 150 100 50 0 300 300
200
200
100
100 0
(a)
0
(b)
Figure 7.23 Google Earth image of seashore near Barcelona (a) and its RGB color histogram (b).
1
1 0.8
0.8
0.6 0.6
0.4
0.4
0.2
0.2
-0.2
0
-0.4
0 1
-0.6
0
-0.8
-1
-1 -0.8 -0.6 -0.4 -0.2
0
(a)
0.2
0.4
0.6
0.8
1
-1 -1
-0.8 -0.6 -0.4 -0.2
0
0.2
0.4
0.6
0.8
(b)
Figure 7.24 Different projections of the granulated seashore image color histogram in Conical coordinates: (a) – lateral view, (b) – above view.
1
Mathematical Apparatus of Spatial Granulation 257 The very useful advantage of model Figure 7.24 is the closeness of the color model space. Formally unlimited values of RGB color model may provide the artifact values of colors [Wyszecki & Stiles, 1982]. Using the color histogram, we can separate the scene image to the domains that have similar color distribution (color image segmentation).
7.10 Conclusions In this chapter, we have introduced the concepts, underlying principles, and applications of image processing by the spatial granulation principles. It’s a very advanced chapter about the common Granular Computing approach [Bargiela & Pedrycz, 2002], [Yao, 2000]. The main results obtained in our work are based on the common point of view, provided by the Information Granulation by L. A. Zadeh and their expansion for the geometry interpretation of the common data representation as in Cartesian and affine coordinates, both in curvilinear coordinates [Butenkov, 2009]. Common spatial data granulation models may be used for wide classes of engineering applications, related with the graphical databases, graphical data processing and recognition problems. The introduced common granule model allows the use of the determinant theory fundamentals for the data space granules manipulation: granules separation, comparison and encapsulation [Butenkov et al., 2016]. For the mentioned applied domains very efficient greedy granulation algorithms are designed [Krivsha et al., 2016]. On the basis of common granules models and appropriate algorithms the broad kinds of digital image processing and analysis applied problems was solved. A few families of software systems were developed and registered for the mentioned problems. Because the new models are invariant under the coordinates transformation and linear for the model inputs, the fundamental principles of Spatial granulation may be used for the theoretical basis of medium models, fractal structures, snow crystals and other physical problems solution [Butenkov & Zhukov, 2009], [Butenkov, 2009].
References Acharya Tinku and Ajoy K. Ray. Image Processing: Principles and Applications. Wiley-Interscience Publication, 2005. 309p. Baldwin J.F., Martin T.P. and Shanahan J.G. “Fuzzy logic methods in vision recognition”, In Fuzzy Logic: Applications and Future Directions Workshop, London, UK, 1997. 300-316.
258 Recognition and Perception of Images Bargiela A. and Pedrycz W. Granular Computing: an Introduction, Kluwer Academic Publishers, Boston, 2002. 247p. Butenkov S. “Fuzzy geometry features for the classification and clusterization”. Artificial Intelligence Magazine, №3, Moscow, 2000, P.129-133. Butenkov S. “Mathematic Models for the Inteligent Multidimensional Data analysis”, Proceedings of International Conf. in Mathematic System Theory MTS2009, Moscow, 2009, January 26-30, P.93-101. Butenkov S. “The Intellectual Data Analysis Development for the Information Granulation”, Proceedings of IV International Conference “Integrated Models and Soft Computing in Artificial Intelligect”, Colomna, May 28-30, 2007, v.1, P.188-194. Butenkov S. “Towards the unified representation of vague multidimensional Data”. Proceedings of AIS’03 Conference, Moscow, State Scientific Publishing, 2003, v.1, P.85-91. Butenkov S. “Uncertainty management in spatial digital databases”. In Proceedings of the IEEE-sponsored Conference In Artificial Intelligence and (AIS 2004), P.85-91, Divnomorskoye, Russia, September 2003. Butenkov S. and Krivsha V. “Classification using Fuzzy Geometric Features”, In Proc. IEEE International Conf. on Artificial Intelligence Systems “ICAIS 2002”, Divnomorskoe, Russia, 5-10 September, 2002, Computer Press, Los Alamos, CA, USA, P.89-91. Butenkov S. and Zhukov A. “Information granulation on the basis of algebraic systems isomorphism”, Proceedings of International Algebraic Conference in memoriam of A.I. Kostrkin, Nalchik, June 12-18, 2009, P.106‑113. Butenkov S. “Granular Computing in Image Processing and Understanding”, Proceedings of IASTED International Conf. on AI and Applications “AIA– 2004”, Innsbruk, Austria, February 10-14, 2004, P.78-83. Butenkov S., Krivsha V. and Al Dhouyani S. “Granular Computing in Computer Image Perception: basic issues and Glass Box models”. In Proceedings of Proceedings of the IASTED International Conference on Artificial Intelligence and Applications, AIA 2006 IASTED International Conference on Artificial Intelligence and Applications, AIA 2006. sponsors: The IASTED, Tech. Committee on Artif. Intell. and Expert Systems, World Modelling and Simulation Forum (WMSF). Innsbruck, 2006. P.462-467. Butenkov S., Zhukov A., Nagorov A., Krivsha N. Granular Computing Models and Methods Based on the Spatial Granulation // XII Int. Symposium «Intelligent Systems», INTELS’16, 5-7 October 2016, Moscow, Russia. Elsevier Procedia Computer Science 103 (2017) P.295-302. Cooke D.J. and Bez H.E. Computer Mathematics. Cambridge University Press, 1984. 374p. Erwig M. and Schneider M. “Vague Regions”. 5th Int. Symp. on Advances in Spatial Databases. LNCS 1262. Springer-Verlag, 1997, P.298–320. Kong T. and Rosenfeld A. “Digital Topology: Introduction and Survey”. Computer Vision, Graphics, and Image Processing, 1989. 48, P.357–393.
Mathematical Apparatus of Spatial Granulation 259 Krivsha N., Krivsha V., Beslaneev Z.O., Butenkov S.A. Greedy algorithms for Granular Computing Problems in Spatial Granulation Technique // XII Int. Symposium «Intelligent Systems», INTELS’16, 5-7 October 2016, Moscow, Russia. Elsevier Procedia Computer Science 103 (2017) P.303-307. Marr D. Vision. San Francisco, W.H. Freeman and Co, 1982. Pedrysz W. “Granular Computing – the emerging paradigm”, Journal of Uncertain Systems, vol.1, No.1, P.38-61, 2007. Pratt W.K. Digital Image Processing. John Wiley & Sons, New York, 2007. 279p. Rogozov Y., Beslaneev Z., Nagorov A. and Butenkov S. “Data Models based on the Information Granulation theory”, Proceedings of V International Conference “System Analysis and Information Technologies” SAIT-2013, Krasnoyarsk, 2013, September 19-25, P.24-33. Rosenfeld A. “Fuzzy Digital Topology”. Information and Control, 40, 1979. P.76–87. Rosenfeld A., Avinash C. Kak, Digital Picture Processing. 1982. Second Edition, Vol. 1 and 2. Academic Press, New York, 332p. Tamura S., Seihaky H. and Kokichi T. “Pattern classification based on fuzzy relations”. IEE Trans. On Systems, Man and Cybernetics, vol. SMC-1, No 1, 1971. Winter S. “Location-based similarity measures of regions”. In: Fritch D. (Editor) ISPS Comission IV Symposium “GIS – between Vision and Applications”, Stutgart, Germany, 1999, P.669-676. Worboys M. “Imprecision in Finite Resolution Spatial Data”. GeoInformatica, 1998. 2(3), P.257–279. Wyszecki G. and Stiles W.S. Color Science. Second Edition. McGraw-Hill. NY, 1982. 359p. Yao Y. “Rough sets, neighborhood systems, and granular computing”. Proc. of the 1999 IEEE Canadian Conference on Electrical and Computer Engineering, Edmonton, Canada, May 9-12, 1999, Meng, M. (Ed.), IEEE Press, P.1553-1558. Yao Y. “Granular computing: basic issues and possible solutions”. Proceedings of the 5th Joint Conference on Information Sciences, 2000. P.186-189. Zadeh L.A. “From Computing with Numbers to Computing with Words – From Manipulation of Measures to Manipulation of Perceptions”. IEEE Trans. on Circ. and Syst., 1999, v.45, 1, P.105-119. Zadeh L.A. Fuzzy sets and information granularity, in: Advances in Fuzzy Set Theory and Applications, Gupta N., Ragade R. and Yager R. (Eds.), NorthHolland, Amsterdam, 1979. P.3-18. Zadeh L.A. “Toward a Theory of Fuzzy Information Granulation and its Centrality in Human Reasoning and Fuzzy Logic”. Fuzzy Sets and Systems, 4, 1997, P.103-111.
8 Inverse Synthetic Aperture Radars: Geometry, Signal Models and Image Reconstruction Methods Andon D. Lazarov1* and Chavdar N. Minchev2 Nikola Vaptsarov Naval Academy, Bulgaria, K.N. Toosi University of Technology, Aerospace Engineering Faculty, Tehran, Iran 2 National Military University, Artillery and Air Defense Faculty, Shumen, Bulgaria 1
Abstract
In the present work the geometry and geometrical transformation, waveforms and signal models, and imaging algorithms applied in the inverse synthetic aperture radar (ISAR) are thoroughly discussed. The emphasis is on the 3-D and 2-D geometry and kinematics of ISAR scenario, high informative waveforms with linear frequency, stepped frequency, and phase code modulation, ISAR signal models based on these waveforms, non-parametric spectral-correlation and parametric image reconstruction algorithms. To verify the geometry and signal models, and imaging algorithms results of numerical experiments are provided. Keywords: Inverse Synthetic Aperture Radar (ISAR), 3-D Geometry, 3-D to 2-D geometric transformation, ISAR waveforms, ISAR image reconstruction algorithms
8.1 Introduction Inverse Synthetic Aperture Radar (ISAR) is a powerful tool with military and civilian applications including detection, tracking, imaging and automatic recognition of non-cooperative targets, battlefield awareness, development and maintenance of low-observable aircrafts and objects with special aerodynamical characteristics as well as radio astronomical planetary *Corresponding author: [email protected] Iftikhar B. Abbasov (ed.) Recognition and Perception of Images: Fundamentals and Applications, (261–326) © 2021 Scrivener Publishing LLC
261
262 Recognition and Perception of Images surveillance, measurements and imaging. ISAR utilizes range-Doppler principle to obtain a desired image of the object. The image range resolution is realized by applying wide bandwidth liner frequency modulation, stepped frequency modulation and Barker phase code modulation waveforms, whereas the cross-range resolution is achieved by using Doppler frequency spectrum generated by the relative displacement of the target with respect to the radar system of observation. For the last two decades many studies disclosing the theory, methods and algorithms applicable in the ISAR technology have been published. A targeted survey of algorithms used in ISAR image processing for ship recognition and evaluation of current state of ships’ image reconstruction algorithms is provided in [Kurowska, 2016]. An ISAR algorithm based on product high-order phase function for imaging of complex non-uniform rotating targets is presented in [Bao et al., 2011]. Methods for high- resolution radar imaging of ship targets at sea based on the general ship 3-D swaying geometry, and effects of ship sailing velocity and 3-D rotation movements to synthetic aperture radar (SAR) imaging mode are discussed in [Yabiao et al., 2007]. A non-uniform rotational motion estimation and compensation approach for ISAR imaging of maneuvering targets by application of particle swarm optimization approach is presented in [Liu et al., 2018]. In [Toumi, Khenchaf, 2016] is suggested an approach for the aircraft target recognition using ISAR images reconstructed by inverse fast Furrier transform and multiple signal characterization methods. A new adaptive ISAR imaging method based on the modified Fourier transform after translational motion compensation of not too severe maneuvering target is presented in [Wang, Xu, Wu, Hu, Chen, 2018]. Equivalent rotation maneuvering is described by two variables—the relative chirp rate of the linear frequency modulated signal and the Doppler focus shift. An ISAR imaging system for targets with complex motion, the azimuth echo signals of the target are always modeled as multi-component quadratic frequency modulation (QFM) signals. A novel estimation algorithm called a two-dimensional product modified parameterized chirp rate-quadratic chirp rate distribution for QFM signals parameter estimation is proposed in [Qu, Qu, Hou, Jing, 2018]. In [Toumi, Khenchaf, 2015] is considered a general problem of treatment and exploitation of ISAR information for recognition and identification of air targets based on radar images by applying optical single-channel correlation and multi-path correlation methods. In [Gromek et al., 2016] bistatic passive SAR configuration utilizing commercial DVB-T transmitters as illuminators of opportunity is analyzed and experimentally verified. Due to motion characteristics, SAR/ISAR images are blurred in most cases.
Inverse Synthetic Aperture Radars 263 In [Xiang et al., 2015] a digital processing scheme is applied for estimating the radial velocity in SAR spotlight processing mod and improving SAR image quality. ISAR imaging of a ship target using product high-order matchedphase transform based on a multi-component cubic phase signal model to estimate the parameters of each component is analyzed in [Wang, Jiang, 2009]. The concept of distributed multi-channel imaging radar system, a set of transmitters and receivers sparsely located to form multiple signal- processing channels with the application of matched filtering of orthogonal signals to reconstruct a target’s 2-D reflectivity function is considered in [Weidong, Xiaoyong, Xingbin, 2008]. Vector analysis to study an illuminated area and Doppler frequency characteristics of airborne synthetic aperture radar (SAR) is applied in [Wang, Yang, 2014]. Mathematical description of a moving target’s range profile and SAR geometry, without any limitations on the model order, and effects of accelerations, jerks and higher orders of target motion on SAR images are analyzed in [Deng et al., 2012]. A conception of employing a sparse reconstruction-based technique for thermal imaging defect detection is proposed in [Roy et al., 2019]. The subject of the present research is the geometry, signal models and imaging algorithms applied in the radar inverse aperture synthesis. The emphasis is on the 3-D and 2-D geometry and kinematics of ISAR scenario, high informative waveforms with linear frequency, stepped frequency, and phase code modulation, ISAR signal models based on these waveforms, non-parametric spectral-correlation and parametric image reconstruction algorithms. To verify the geometry and signal models, and imaging algorithms results of numerical experiments are provided. The rest of the chapter is organized as follows. In section 8.2, 3-D geometry and 3-D to 2-D geometrical transformation are described. In section 8.3, 2-D ISAR signal models and reconstruction algorithms are presented. In section 8.4, 3-D ISAR signal models, nonparametric and parametric reconstruction algorithms are discussed. In section 8.5, concluding remarks are made.
8.2 ISAR Geometry and Coordinate Transformations 8.2.1 3-D Geometry of ISAR Scenario Consider an object depicted in a 3-D regular grid of reference points in a Cartesian coordinate system O’XYZ, moving on rectilinear trajectory
264 Recognition and Perception of Images at a constant vector velocity V in a coordinate system Oxyz (Figure 8.2.1). The object geometric centre, the origin of 3-D coordinate grid, and the origin of the coordinate system O’XYZ coincide. The reference points are placed at each node of the 3-D grid and describe the shape of the object. The current distance vector Rijk(p) = [xijk(p), y ijk(p), zijk(p)]T measured from ISAR, placed in the origin of the coordinate system Oxyz to the ijk-th point from the object space, is determined by the vector equation
Rijk(p) = R0′(p) + ARijk,
(8.2.1)
where R0′(p) = [x0′(p), y0′(p), z0′(p)]T is the current distance vector of the object geometric centre defined by the expression R0′(p) = R0′(0) + V.Tp.p, where p = 1, N is the number of the emitted pulse, N is the full number of emitted pulses, R0′(0) = [x0′(0), y0′(0), z0′(0)]T is the distance vector to the geometric centre of the object space at the moment of the first emitted pulse, p = 0, Tp is the pulse repetition period, V = [V.cosα, Z z Rijkt Y O’ Rijk (p)
V R
X R0' (p)
y O x
Figure 8.2.1 Geometry of 3-D ISAR scenario.
Inverse Synthetic Aperture Radars 265 V.cosβ, V.cosδ]T is the vector velocity, Rijk = [Xijk, Yijk, Zijk]T is the distance vector to the ijkth reference point in coordinate system O’XYZ, Xijk = i(ΔX), Yijk = j(ΔY) and Zijk = k(ΔZ) are the discrete coordinates of the ijk-th reference point in the coordinate system O’XYZ; ΔX, ΔY and ΔZ are the spatial dimensions of the 3-D grid cell; cos α, cos β and cosδ = 1 − cos 2 α − cos 2 β are the guiding cosines and V is the module of the vector velocity. The indexes, i, j, k, are uniformly spaced in the coordinate axes, O’X, O’Y, and O’Z, point O′(0) with coordinates x0′(0), y0′(0), z0′(0) defines the initial position of the origin of the coordinate system O’XYZ. The elements of the transformation matrix A in equation (8.2.1) are determined by the Euler expressions
a11 = cos ψ cos φ – sin ψ cos θ sin φ; a12 = −cos ψ sin φ – sin ψ cos θ cos φ; a13 = sin ψ sin θ; a21 = sin ψ cos φ + cos ψ cos θ sin φ; a31 = sin θ sin φ; a22 = −sin ψ sin φ + cos ψ cos θ cos φ; a32 = sin θ cos φ; a23 = −cos ψ sin θ; a33 = cos θ.
(8.2.2)
Angles ψ, θ and φ define the space orientation of the 3-D coordinate grid, where the object is depicted and are calculated by the expressions
A ψ = arctan − ; B
C
θ = arccos 2
2
1 2 2
;
(8.2.4)
[( A) + ( B) + (C ) ]
Vx B − Vy A
ϕ = arccos
(8.2.3)
2
2
2
2
2
{[( A) + ( B) ][(Vx ) + (Vy ) + (Vz ) ]}
1 2
,
(8.2.5)
where A, B and C are the components of the normal vector to a reference plane N. The reference plane is defined by the origin of the coordinate
266 Recognition and Perception of Images system Oxyz, point O, the object’s geometric centre at the moment p = 0, point O′(x0′(0), y0′(0), z0′,(0)), trajectory line’s equation
x − x 0′ (0) y − y0′ (0) z − z 0′ (0) = = , Vx Vy Vz
(8.2.6)
and described by the matrix equation
x det x 0′ (0) V x
y y0′ (0) Vy
x 0′ (0) = 0. Vz z
(8.2.7)
The equation (8.2.7) can be rewritten as
Ax + By + Cz = 0
(8.2.8)
where
A = Vz.y0′(0) – Vy.z0′(0); B = Vx.z0′(0) – Vz.x0′(0); C = Vy.x0′(0) – Vx.y0′(0),
(8.2.9)
The module of the ijk-th scattering point’s position vector, Rijk(p) is defined by equation 1
Rijk ( p) = [( xijk ( p))2 + ( yijk ( p) + (z ijk ( p))2 ]2 .
(8.2.10)
The expression (8.2.10) can be used for modelling ISAR signals, reflected from the 3-D object space, and to calculate the time delay of the signal reflected from the object scattering points.
8.2.2 3-D to 2-D ISAR Geometry Transformation The ISAR signal, reflected from 3-D object is depicted in a 2-D coordinate system defined by range and cross range (azimuth) axes. It means that three-dimensional object geometry is projected onto 2-D signal registration plane defined by the geometric centre, the trajectory line and the
Inverse Synthetic Aperture Radars 267 Ykˆ
ey
-Xp
z
A
X pˆ kˆ
Yk
Ypˆ kˆ
θ
X pˆ
Ypk θ O’
X pˆ kˆ θ
R0'(p)
Xp
y O x R0
Xp
Figure 8.2.2 Two 2-D coordinate systems built in the ISAR signal plane depicted in the coordinate system Oxyz.
origin of the coordinate system Oxyz. Hence, the 2-D signal registration plane coincides with the reference plane, described by coordinates of point O, point O’ and equation (8.2.6). Two 2-D coordinate systems are built in a 2-D signal registration plane, a coordinate system tracking the object geometric centre, OXpYk and a coordinate system firm tied with the object, O ′X pˆYkˆ (Figure 8.2.2). The axis OYp = OO’ (Yp) with a steering vector R0′(p), tracking the object geometric centre can be described by the equation
x y z = = x 0′ ( p ) y 0 ( p ) z 0′ ( p )
(8.2.11)
The axis OXp (Xp) with a steering vector R0(p) is orthogonal to both the steering vector R0′(p) and the normal vector to the reference plane N, and can be described by the equation
x y z = = , x 0 y0 z 0
(8.2.12)
268 Recognition and Perception of Images where x0, y0 and z0 are the coordinates of the steering vector R0(p). Based on orthogonality between R0(p), R0′(p), and N, the following expressions to calculate x0, y0, and z0 hold
x 0 = y0′ ( p).C − z 0′ ( p).B y0 = z 0′ ( p). A − x 0′ ( p).C .
(8.2.13)
z 0 = x 0′ ( p).B − y0′ ( p). A
The axes of the coordinate system firm tied with the object are defined by the following equations. Assume a steering vector ex = V of a coordinate axis O ′X pˆ ( X pˆ ) then its equation can be written as
x − x 0′ (0) y − y0′ (0) z − z 0′ (0) = = ex1 ex 2 ex 3
(8.2.14)
where ex1 = Vx, ex2 = Vy, ex3 = Vz are the coordinates of the steering vector ex of axis X pˆ. Comparing (8.2.6) and (8.2.14) can be seen the axis O ′X pˆ ( X pˆ ) and trajectory line have the same equation. Assume a steering vector ey orthogonal to ex for orientation of a coordinate axis O ′Ykˆ (Ykˆ ). Based on orthogonality of vectors N, ey, and ex it can be written ey = N × ex, then for coordinates, ey1, ey2, ey3 of the steering vector, ey, the following equations hold
ey1 = B.ex3 – C.ex2
(8.2.15)
ey2 = C.ex1 – A.ex3
(8.2.16)
ey3 = A.ex2 – B.ex1
(8.2.17)
Accordingly, the three-dimensional geometry of the object defined by 3-D distance vector Rijk = [Xijk, Yijk, Zijk]T is transformed into a two- T dimensional one, defined by 2-D distance vector R pk ˆ ˆ = [ X pk ˆ ˆ ,Ypk ˆˆ] . Coordinates X pk ˆ ˆ of ijk-th scattering point in coordinate system ˆ ˆ and Ypk O ′X pˆYkˆ are calculated by scalar products as follows
X pk ˆˆ =
e x (A .Rijk ) ex
,
(8.2.18)
Inverse Synthetic Aperture Radars 269
Ypk ˆˆ =
e y (A .Rijk ) ey
.
(8.2.19)
The vector (A.Rijk) transformed to the coordinate system Oxyz, can be expressed as
(A .Rijk ) = X pk ˆ ˆ e x + Ypk ˆ ˆ e y + s .N ,
where s =
(8.2.20)
N.(A .Rijk )
is the normal separation (distance) between the sigN nal plane and the ijk-th scattering point, N.(A.Rijk) is the scalar product, A is the transformation matrix.
Decomposition of the Rectilinear Movement into Translation and Rotation By introducing the 2-D tracking coordinate system O’XY, the rectilinear movement of the object is decomposed into translation of the geometric centre along OY axis and rotation of the object in respect of its geometric centre, point O’. The translation of the geometric centre is defined by the distance vector R0′(p) = R0′(0) + V.T.p. The rotation of the object around its geometric centre is defined by apparent rotation of the coordinate system firm tied with the object, O ′X pˆYkˆ in respect of the tracking coordinate system, O’XY which the origin is translated in the geometric centre of the object, point O’. The translation does not change steering vectors of coordinate axes X and Y. The rotating angle θ(p) between axes Y and Ykˆ is defined by normalized scalar product of their steering vectors R0′(p), and ey, i.e.
θ( p) = arccos
x 0′ ( p).e y1 + y0′ ( p).e y 2 + z 0′ ( p).e y 3 (e y1 )2 + (e y 2 )2 + (e y 3 )2 . [x 0′ ( p)]2 + [ y0′ ( p)]2 + [z 0′ ( p)]2 (8.2.21)
ˆ ˆ) -th reference point in the tracking coorThe position vector of the ( pk dinate system, OXY can be written as
R pk ˆ ˆ ( p ) = R 0′ ( p ) + B.R pk ˆˆ ,
(8.2.22)
270 Recognition and Perception of Images T ˆˆ where R pk ˆ ˆ ( p ) = [ X pk ˆ ˆ ( p ),Ypk ˆ ˆ ( p )] is the position vector of the pk th reference point in the coordinate system O′ XpYk, R0′(p) = [0, R0′(p)]T is the
cos θ( p) − sin θ( p) distance vector in coordinate system OXY, B = cos θ( p) sin θ( p) is the rotation matrix, R0′(p) is the time dependent module of the distance vector to the object geometric centre that defines radial displacement of the geometric centre. From equation (8.2.22) follows that rectilinear movement of the object is decomposed into radial translation of the object geometric centre defined by R0′(p) = [0, R0′(p)]T and rotation of the object around its geometric centre defined by B.R pk ˆˆ . In order to determine the position of the ISAR image in the frame, a stationary 2-D coordinate system of observation Ox′y′ is built in the reference plane. Axis Ox’ is a crossing line between a reference plane and a plane Oxy, defined by equations x y A.x + B.y + C.z = 0, and z = 0 or by the equation = − . Based on B A orthogonality of Oy’, Ox’, and N, the equation of the axis Oy’, the equation of the axis Oy can be written as
x y −z = = 2 2 . A.C B.C A + B
(8.2.23)
The angle α between Ox’ axis and vector velocity, V is defined by the expression
α = arccos
B.Vx − A.Vy A2 + B 2 . Vx2 + Vy2 + Vz2
.
(8.2.24)
The azimuth angle Φ(p) between the line of sight OO’(p) and Ox’ axis is defined by the expression
Φ( p) = arccos
x 0′ ( p).B − y0′ ( p). A A2 + B2 . [x 0′ ( p)]2 + [ y0′ ( p)]2 + [z 0′ ( p)]2
.
(8.2.25)
The initial azimuth Φ(0) angle (in case p = 0) is calculated by
Φ(0) = arccos
x 0′ (0).B − y0′ (0). A 2
2
A + B . [x 0′ (0)]2 + [ y0′ (0)]2 + [z 0′ (0)]2
.
(8.2.26)
Inverse Synthetic Aperture Radars 271 Point O′(0) defines the initial position of the object geometric centre, the line OO’ is the initial line if sight of the target, preliminary known. The trajectory line O’B located at velocity angle α with respect to axis Ox’ denotes the half synthetic aperture length calculated by the half number of emitted pulses (measurements) N/2, linier velocity V, and period of measurement (pulse repetition period) Tp, i.e.
1 O ′B = N .V .Tp . 2
(8.2.27)
The initial azimuth angle Φ = Φ(0) in the moment p = 0, is defined between the line of sight OO’ and axis Ox’. The line O’B is the reference line that defines the half synthetic aperture length measured from entrance point O’ to point B. The angle /2 between OO’ and OB, denotes the angular size of the half synthetic aperture length. It is important to define the size of the synthetic aperture length or the number of measurement pulses that guarantees an azimuth resolution equal to the range resolution defined by the frequency bandwidth of emitted pulses. The effective
y’
B
Yˆ k
A
Θ
Y O’
γ/2
V
Φ
C
O
Θ X
-α
x’
X pˆ
Figure 8.2.3 The coordinate systems O’XY and O ′X pˆYkˆ built in stationary 2-D ISAR imaging plane Ox′y′.
272 Recognition and Perception of Images synthetic aperture length Lef = AO′ is defined by geometrical relationships in the triangle OCB, where the angle OBC is equal to α – Φ – γ/2, i.e.
ΔLef = O′B.sin(α – Φ – γ/2).
(8.2.28)
Based on the antenna diffraction theory the angular dimension of the synthetic aperture’s pattern or angular azimuth resolution for a wavelength λ is calculated by
∆φ =
λ λ = . 2.Lef N .V .Tp .sin(α − Φ − γ / 2)
(8.2.29)
Then the linear azimuth resolution is calculated by
∆L = OO ′.∆φ =
OO ′.λ.cos γ / 2 . N .V .Tp .sin(α − Φ − γ / 2)
(8.2.30)
The number of pulses, necessary to achieve a linear azimuth resolution equal to a range resolution is calculated by
N=
OO ′.λ.cos γ / 2 ∆L.V .Tp .sin(α − Φ − γ / 2)
(8.2.31)
Applying a sinus theorem in triangle OO’B yields
OO ′ =
O ′B.sin(α − Φ − γ / 2) . sin γ / 2
(8.2.32)
Substitute (8.2.27) and (8.2.32) in (8.2.30), then
∆L =
λ.cos γ / 2 . 2.sin γ / 2
(8.2.33)
Equation (8.2.7) is used for determination of the half angular size of the synthetic aperture for a linear azimuth resolution, i.e.
( γ / 2) = arctan
λ . 2.∆L
(8.2.34)
Inverse Synthetic Aperture Radars 273 The inverse value of the composite angle (α – Φ – γ/2) is the rotation angle necessary to orientate the object image in the frame, its longitudinal axis to be parallel to Ox’ axis in the coordinate system of imaging Ox’y’.
2-D ISAR Geometry The ISAR scenario can be depicted in a 2-D coordinate system of observation. The current position of the ij-th scattering point Pij in respect of the origin, point O of the coordinate system Oxy is defined by the vector equation
Rij(p) = R0′(p) + A.Rij,
(8.2.35)
where R0′(p) = [x0′(p), y0′(p)]T is the current position vector of the object’s geometric centre, point O’ defined in coordinate system Oxy, Rij = [Xij, Yij]T is the position vector of the ij-th point in the object’s coordinate system O′XY, A is the transformation matrix defined by angle α, [.]T denotes transposition. Based on the geometry in Figure 8.2.4, the vector equation (8.2.35) can be rewritten in matrix form
x ( p) = 0′ y 0′ ( p )
xij ( p) yij ( p)
sin α Xij cos α Yij
cos α + − sin α
Pij Y y j Yij = j. ∆Y
-i Rij Xij = -i. ∆X ∆X
∆Y
Yij = j. ∆Y O’
a
O
V X
Figure 8.2.4 2-D ISAR geometry.
x
,
(8.2.36)
274 Recognition and Perception of Images where Xij = i.ΔX and Yij = j.ΔY are discrete coordinates of the ij-th scattering point, i and j are scattering point’s indices on O’X and O’Y axes, ΔX and ΔY are discrete distances between two adjacent scattering points measured on O’X and O’Y axes. The module of the position vector Rij(p) of the ij-th scattering point is defined by 2
1 2 2
Rij ( p) = [( xij ( p)) + ( yij ( p)) ] .
(8.2.37)
The expression (8.2.37) is used to model the 2-D ISAR signal, reflected from the object.
8.3 2-D ISAR Signal Models and Reconstruction Algorithms 8.3.1 Linear Frequency Modulation Waveform A wideband linear frequency modulation (LFM) waveform is used to illuminate moving objects and to extract an image of high quality. The narrow autocorrelation function of the LFM waveform defines a high range resolution; however, its sidelobe levels are very high which leads to overlapping reflected signals from adjacent close spaced scattering points. The sequence of pulse LFM waveforms is written by the expression
s(t) =
N
∑ p=1
a p .rect
t − p.Tp exp − j[ω(t − p.Tp ) + b(t − p.Tp )2 ], T
(8.3.1)
where ap is the amplitude and p is the index of the emitted LFM pulse waveform, respectively, ω = 2π.c/λ is the angular frequency, Tp is the pulse waveform’s repetition period, T is the time width of a LFM pulse waveform, c = 3.108 m/s is the speed of the light, b = π ΔF/F is the LFM rate, ΔF is the LFM pulse bandwidth providing the size of the range resolution cell, i.e., ΔR = c/2ΔF.
Inverse Synthetic Aperture Radars 275 The components of the current time are slow time p.Tp and fast time t, i.e., t = t + p.Tp which can be expressed as t = t .mod(Tp ). Thus, for each p, (8.3.1) can be rewritten as
s(t) = ap.rect(t / T).exp[−j(ω.t + b.t2)],
(8.3.2)
1 if 0 ≤ (t / T ) < 1 where rect(t / T ) = is the rectangular function. 0 otherwise For emitted waveforms the fast time in discrete form can be expressed as t = (k − 1)ΔT, where ΔT is the sample time width, k = 1, K is the LFM sample index; K = T/ΔT is the total LFM samples number.
8.3.2 2-D LFM ISAR Signal Model - Geometric Interpretation of Signal Formation According Huygens principle the 2-D ISAR signal can be presented as an algebraic sum of partial signals reflected by illuminated object’s scattering points, i.e., I
S(t , p) =
J
∑∑ i =1
j =1
aij rect
tˆij .exp{− j[ω .tˆij + b(tˆij )2 ]}, T
(8.3.3)
where tˆij = t − tij , tij = 2Rij/c is the signal time delay from the ij-th scattering point, measured at the p-th moment, Rij = Rij(p) is the module of the distance vector to the ij-th scattering point, I and J are the number of scattering points on OX, and OY axes
1 if 0 ≤ (t − t )/ T < 1 ij . rect[(t − tij )/ T ] = 0 otherwise
In case ISAR signal formation, the fast time is expressed as t = tijmin + 2 Rij min (k − 1)ΔT, where tij min = is the signal time delay from the nearest c (p) is the distance to the nearest scattering scattering point, Rijmin = Rijmin 2 Rij max point, measured at the p-th moment. Denote tij max = as the sigc nal time delay from the furthest scattering point, Rijmax = Rijmax(p) is the
276 Recognition and Perception of Images distance to the furthest scattering point, measured at the p-th moment. tij min tij max kij min = as the index of the nearest range bin, kij max = as T T the index of the furthest range bin, L = kijmax − kijmin as the object’s relative length on the range direction, then the range bin index k accepts values in the interval k = 1, K + L . Demodulate the ISAR signal by multiplication with the complex conjugated emitted LFM pulse waveform, i.e.,
Sˆ(t , p) = S(t , p) × rect(t / T ).exp[− j(ωt + bt 2 ],
(8.3.4)
which yields
Sˆ(t , p) =
J
I
∑∑ i =1
aij rect
j =1
tˆij .exp{− j[(ω + 2bt )tij − b(tˆij )2 ]}. (8.3.5) T
where ω(t) = ω + 2bt is the current angular frequency of the emitted LFM waveform, which in discrete form is expressed as ωk = (k − 1)Δωk, where ω ωk = + 2b( T ) . k −1 Equation (8.3.5) in discrete form can be rewritten as
Sˆ(k , p) =
I
J
∑∑ i =1
aij rect
j =1
tˆij exp[− j(ω ktij − b(tij )2 )]. T
(8.3.6)
The quadratic term b.(tij)2 is the along track uncompensated slow time phase, and can be removed from the sum, i.e.,
I
J
∑∑
Sˆ(k , p) = exp[ jb(tij )2 ].
i =1
j =1
aij rect
tˆij exp[− j(ω ktij )]. (8.3.7) T
In case plane electromagnetic wave approximation, the distance to the ij-th scattering point, Rij(p) is described by
Rij(p) = R(p) + Xij sin θ(p) + Yij cosθ(p).
(8.3.8)
Inverse Synthetic Aperture Radars 277 Then (8.3.7) can be rewritten as I
J
∑∑
Sˆ(k , p) = exp[ jb(tij )2 ]
i =1
aij rect
j =1
tˆij × T
R( p) + Xij sin θ( p) + Yij cos θ( p) exp − j 2ω k , c
(8.3.9)
where Xij = i.ΔX, Yij = j.ΔY are the coordinates of the (i, j) -th scattering point. Due to the short coherent processing interval, θ(t) is small, i.e., cos θ(p) ≈ 1 and sin θ(p) = θ(t). Denote θ(p) = p.Δθ, where Δθ is a uniform angular R( p) step. The term 2ω k does not depend on the object’s discrete coordic nates (i, j) and can be placed in front of the sum, Then, it can be written
Sˆ(k , p) = exp[ jΦ( p)] ×
I
J
∑∑ i =1
aij rect
j =1
tˆij 2π. p.i 2π..k. j .exp − j + , T K N
(8.3.9a)
R( p) ,N=c/ where the following substitutions hold Φ( p) = b.(tij )2 − 2ω k c 2ΔX. Δθ.f k, f k = ωk / 2π, K = c / 2(Δf k)ΔY, Δf k = Δωk / 2π. The ISAR signal Sˆ(k , p) can be considered as a 2-D Fourier transform of aij .δ( X − Xij )δ(Y − Yij ) the 2-D discrete reflectivity function expressed as
∑ ij
and multiplied by the exponential term exp[j(ϕ(p) – ϕk(p))] induced by object’s mass center radial displacement and along track phase variation. Hence, the image reconstruction procedure in its essence extracts the 2-D discrete reflectivity function of the object from the 2-D ISAR signal Sˆ(k , p) [Lazarov, 2011].
8.3.3 ISAR Image Reconstruction Algorithm The image reconstruction means a procedure to extracting the object’s 2-D discrete reflectivity function. The following computational stages are applied. 1. Range alignment The object travels several range bins during the observation time due to the translational motion of the mass center on the line of sight (Figure 8.3.1).
278 Recognition and Perception of Images V 7
6
5
4
C 3
2
1 B R(p)
V.p .T p
θ
A
R(0)
ISAR
Figure 8.3.1 Traveling of the object trough range bins in case rectilinear movement.
The common range alignment algorithm puts each reference point in its initial range bin and compensates for all signal’s phases induced by radial displacement through the bins. In contrast, in the present work, the range alignment is performed by renumbering real range bins addresses where signals are registered by numbers in the interval from k = 1 to k = K + L, i.e., for each p the number k = 1 is assigned to the first range cell where the ISAR signal is registered. 2. Radial and higher order motion compensation Phases of the ISAR signal S(k, p) induced by radial and higher order motion are compensated by its multiplication with complex conjugated exponential term exp[−jΦ(p)], i.e.,
S(k , p) = S(k , p).exp[− jΦ( p)].
(8.3.10)
The exponential term exp[−jΦ(p)] stands for the higher order motion compensation function commonly called an autofocusing function. The function Φ(p) can be expressed as a polynomial
Φ(p) = a1(p.Tp) + a2(p.Tp)2 + … + am(p.Tp)m,
(8.3.11)
where the coefficients are calculated iteratively using an entropy image quality evaluation function.
Inverse Synthetic Aperture Radars 279 3. Range compression by discrete inverse Fourier transform of the phase corrected complex signal matrix, i.e.,
1 S( j , p) = K
K +L
∑ S(k, p).exp j 2πK.k. j ,
(8.3.12)
k =1
4. Azimuth compression and a complex image extraction by discrete inverse Fourier transform of the range compressed ISAR signal, i.e.,
1 I (i , j ) = NK
N
∑ S( j, p).exp j 2πN. p.i .
(8.3.13)
p=1
As a conclusion, after a radial and higher order motion compensation the final image is extracted by applying 2-D Fourier transform of the ISAR signal.
8.3.4 Correlation - Spectral ISAR Image Reconstruction The ISAR signal S(k, p) is described in coordinates, p = 1, N - the number of emitted pulse waveforms and k = 1, K + L - the number of samples on the range direction for each emitted pulse waveform. While the range alignment of an ISAR signal is performed just after its detection in the first range cell, the fast time t is defined by t = (k − 1)ΔT, where ( k = 1, K + L ) is the index of the range cell where the ISAR signal is registered. In case the minimal time delay tijmin(p) is assigned to one and the same first scattering point of the object during inverse aperture synthesis, in order to achieve a high-quality imaging, only 2-D inverse Fourier transform is needed to apply on range and cross range (azimuth) direction, described by (8.3.12) and (8.3.13). In case the minimal time delay tijmin(p) is not assigned to one and the same scattering point of the object, i.e., the nearest first scattering point changes its address during inverse aperture synthesis, in order to achieve high-quality imaging a special 2-D autofocusing image reconstruction procedure is applied. In case the first scattering point of the object changes its address in the ISAR signal record during inverse aperture synthesis, the ISAR image becomes unfocused. This smearing image can be improved by
280 Recognition and Perception of Images applying a specially designed autofocusing technique. The procedure requires definition of the ISAR signal reflected by the first scattering point. Image extraction from ISAR data includes two operations – range compression and azimuth compression. Instead of inverse Fourier transform, the range compression can be realized by correlation of ISAR signal (8.3.6) using complex conjugated LFM emitted waveform in compliance with the expression, i.e.,
Sˆ(q, p) =
K +L
∑ Sˆ(k, p)exp[ j(ω((k − q)
T ) + b((k − q) T ) 2 )],
(8.3.14)
k =1
where Sˆ(k , p) is the ISAR signal (8.3.3) redefined in (k, p) coordinates, q = 1, 2( K + L ) , k = 1, K + L . Azimuth compression or final complex image extraction is accomplished by inverse Fourier transform of range compressed ISAR data for each q- th row, i.e.,
I (q, pˆ ) =
1 N
N
∑ p=1
2πppˆ S(q, p)exp − j , N
(8.3.15)
where p = 1, N , q = 1, 2( K + L ) − 1 .
8.3.5 Phase Correction Algorithm and Autofocusing Phases of higher order induced by motion of higher order (acceleration, jerk, etc.) are compensated by a special phase correction algorithm called autofocusing. After the ISAR signal from the first (nearest) scattering point is located in k = 1 range bin, the phase history of this signal, Φ0(p), is determined for each p based on the first 3-5% of measurements. Φ0(p) is approximated by polynomial of higher order, i.e.,
Φ(p) = a1(p.Tp) + a2(p.Tp)2 + a3(p.Tp)3 + …am(p.Tp)m
(8.3.16)
The coefficients a1, a2,…, am are calculated by using standard minimum mean square error approach. Then, approximation function Φ(p) is
Inverse Synthetic Aperture Radars 281 extrapolated in the whole interval p = 1, N . Hence, the phase correction function is defined as
Φ( p) F ( p) = exp j , A
(8.3.17)
where ΔΦ(p) = Φ(p) – Φ0(p) is the phase difference between measured and approximated phase history of the first scattering point; A is the correction coefficient determined, for example, in the interval 1-5 by iteration procedure maximizing a special contrast cost functions. The first cost function C1(A) = M(A) is defined by full number of the pixels in the object space, M(A), which intensities exceed the value 0.8 max[S( pˆ , kˆ)] . In case the cost function C1(A) = M(A) has more than one maximum the additional estimation of the cost function in the maximums is needed. The second cost function aims to define the distribution of the image pixels, which intensities exceed the value 0.8 max[S( pˆ , kˆ)] column wise, as follows
C2 ( A) =
( N 0.2 )2 pˆmax − pˆmin
(8.3.18)
where N0.2 is the number of the columns containing the pixels, which number is greater than 0.2Pmax, where Pmax is the maximum number of the pixels counted in some pˆ th column, pˆmin and pˆmax are the minimal and maximal indexes of the columns, where these pixels are placed. The autofocusing procedure consists in multiplying ISAR signal S(k, p) by focusing function, according to expression
Φ( p) Sˆ(k , p) = S( p, k )exp j . A
(8.3.19)
In order to obtain final focused image of the observed object correlation range compression and FFT azimuth compression to phase corrected ISAR data must be applied. If the quality of the image is not satisfactory the autofocusing procedure is repeated by varying the values of the coefficient A.
Numerical Experiments To substantiate the properties of the 2-D model of ISAR signal and to verify the correctness of the created image reconstruction procedure, including
282 Recognition and Perception of Images range alignment, range compression, azimuth compression, and phase correction numerical experiments are carried out.
Experiment – 1 It is assumed that the object depicted in 2-D coordinate system O’XY is moving rectilinearly in the plane z = constant in the 3-D Cartesian coordinate system Oxyz of observation. Assume, the first (nearest) scattering point of the object does not change its address during the inverse aperture synthesis. In this case the inverse synthetic aperture, i.e., the trajectory of the object intersects the traverse. To realize this scenario the trajectory parameters are as follows: the module of the vector velocity V = 400 m/s; the vector velocity guiding angle α = π. The initial Cartesian coordinates of the geometric center: x00′(0) = 0 m, y00′(0) = 49749.37 m, z00′(0) = 5 × 103m. Parameters of the LFM emitted waveform: wavelength λ = 3 × 10=2 m, pulse repetition period for phase history data registration in compliance with the pulse-Doppler principle of the phase measurement Tˆp = 10−5 s , ISAR pulse repetition period for inverse aperture synthesis Tp = 7.47 × 10−3s, LFM pulse time width T = 10−6 s, LFM samples K = 300, carrying frequency f = 1010 Hz, LFM sample’s time width ∆T=T/K, LFM waveform bandwidth ΔF = 1.5 × 108 Hz, LFM rate b = 9.42 × 1014, number of LFM emitted pulses during inverse aperture synthesis N = 500. Modeling ISAR signal requires that the whole geometry of the object be enclosed in a 2-D regular rectangular grid, described in the coordinate system O’XY (Figure 8.3.2). Uniform scattering points are placed at each node of the regular grid. The dimensions of the 2-D grid’s cell are ∆X = ∆Y = 0.5 m. The number of the reference points of the grid on the axes X and Y are I = 64, and J = 64, respectively. The mathematical expectation of intensities of the scattering points placed on the object is aij = 0.1. The mathematical expectation of the normalized intensities of the scattering points placed out of the object
Figure 8.3.2 2-D geometrical model of aircraft MiG-35 in 64x64 pixels grid.
Inverse Synthetic Aperture Radars 283 is aij = 0.001. Quadrature components of the ISAR signal in 3-D isometric projection are presented in Figure 8.3.3 a, b. Real and imaginary parts of the ISAR signal after correlation range compression as 3-D isometric projections are presented in Figure 8.3.4 (a, b). The final object image extracted by FFT operation is presented in Figure 8.3.5 (a) as normalized intensities of the scattering points (isometric projection). To attain unambiguity of the image, the number of coefficients of the FFT procedure must be equal to the number of the pulses selected among full number of signals registered during inverse aperture synthesis. In Figure 8.3.5 (b) the ultimate image is presented by pseudo-color map (256 levels of grey scale). The image is depicted in a grid of 128x128 pixels. Experiment – 1 is carried out in case the first scattering point does not change its address during inverse aperture synthesis. Therefore, the measured and approximated values of the phase histories are equal, that means
400
400
200
200
0
0
-200
-200
-400 400
300
200
100
k
0 0
100
200
300
400
500
-400 400
300
200
100
k
p
0 0
(a)
100
200
300
400
500
p
(b)
Figure 8.3.3 LFM ISAR signal: real part (a), and imaginary part (b).
×104
×104
1
1
0.5
0.5
0
0
-0.5
-0.5
-1 800
600
400 200 k
100
0 0
(a)
200 p
300
400
500
-1 800
600
400 200 k
0 0
100
200
300
p
(b)
Figure 8.3.4 LFM ISAR signal after correlation range compression: real (a), and imaginary part (b).
400
500
284 Recognition and Perception of Images 120 1
100
0.8 0.6
80
0.4
60
0.2 40
0 150
150
100
100
50 0
20
50 0
(a)
20
40
60
80
100
120
(b)
Figure 8.3.5 3-D isometric projection of the object image after FFT azimuth compression (a), ISAR image of aircraft MIG-29 in 2-D pseudo-color map view (b).
ΔΦ(p) = 0. Autofocusing procedure does not influence on the quality of image because the image proves to be focused.
Experiment – 2 In most practical cases the final image of the object is unfocused due to the change of the first scattering point’s address during the inverse aperture synthesis. It depends on the initial position of the observed object in reference to the ISAR location. To illustrate defocusing phenomena in the image reconstruction procedure the following parameters are used for inverse aperture synthesis: the module vector velocity V = 400 m/s; the guiding angle of the vector velocity, α = π; the angle between coordinate axes, object geometric center initial spherical coordinates: object geometric center’s initial coordinates: x00′(0) = 36319.24m, y00′(0) = 71280.52 m, z00′(0) = 5 × 103 m, ISAR parameters: wavelength λ = 3 × 10−2 m, pulse repetition period for phase history registration Tˆp = 10−5 s in compliance with the pulse-Doppler principle of the phase measurements, the ISAR pulse repetition period Tp = 1.325 × 10−2s, time duration of the emitted LFM pulse T = 10−6s, carrying frequency f = 1010 Hz, LFM signal bandwidth ΔF = 1.5 × 108Hz, LFM rate b = 9.42 × 1014, number of emitted waveforms during inverse aperture synthesis N = 512. The modeling of the ISAR signal is based on object geometry, depicted in Figure 8.3.2. The time dimension of the ISAR signal is greater than LFM emitted waveform’s time width and depends on linear dimensions of the object along the line of sight. The real and imaginary part of the ISAR signal in 3-D isometric projection are presented in Figure 8.3.6, (a) and (b).
Inverse Synthetic Aperture Radars 285
400
400
200
200
0
0
-200
-200
-400 400
300
200
100
k
0 0
100
200
300
400
500
-400 400
300
200
100
k
p
(a)
0 0
100
200
300
400
500
p
(b)
Figure 8.3.6 LFM SAR signal: real part (a), and imaginary part (b).
The time duration of the ISAR signal is always greater than the time dimension of the emitted waveform. It requires zeros padding in columns of the 2-D received ISAR signal in order to reach the elements’ number in the column with maximum elements. Real and imaginary parts of the ISAR signal after correlation range compression as 3-D isometric projections are presented in Figure 8.3.7 (a), and (b). The final image of aircraft MIG 29 as normalized intensities of the scattering points (isometric projection) extracted by FFT operation is presented in Figure 8.3.8 (a), and as a grey scale pseudo-color map in Figure 8.3.8 (b). As can be seen the aircraft image is unfocused due to altering the address of the first scattering point during inverse aperture synthesis. It requires autofocusing procedure to be applied. After correlation range compression, the trajectory of the first scattering point of the object is tracked, and
×104
×104
1
1
0.5
0.5
0
0
-0.5
-0.5
-1 800
600
400 k
200
100
0 0
(a)
200 p
300
400
500
-1 800
600
400 k
200
0 0
100
200
300
400
p
(b)
Figure 8.3.7 LFM ISAR signal after correlation range compression: real part (a), and imaginary part (b).
500
286 Recognition and Perception of Images 120 1
100
0.8 80
0.6 0.4
60
0.2 40
0 150
150
100
100
50 k
20
50
0 0
p
(a)
20
40
60
80
100
120
(b)
Figure 8.3.8 Aircraft Mig-35 final image in 3-D isometric projection (a), and in pseudocolor map (b) after FFT azimuth compression.
Φ0(p) is defined. The unwrapped procedure over phase history Φ0(p) from the first scattering point is performed. Unwrapped phase history Φ0(p), approximated and extrapolated curve Φ(p) are shown with a solid line and dashed line in Figure 8.3.9 (a) and (b - in zoom mode). The phase difference ΔΦ(p) is depicted in Figure 8.3.10. The increasing of ΔΦ(p) after, approximately, p = 130 moment illustrates address altering of the first scattering point. In order to obtain a focused image, the ISAR signal is multiplied with a phase correction function exp[jΔΦ(p)/A]. The coefficient A is used for additional image focusing. The optimal value of A is defined iteratively by estimation of the first, C1(A), and second, C2(A) cost functions, described in section 2.5, and shown in Figure 8.3.11 (a), (b). As can be seen a focused image is obtained for A = 1. The two correction functions have maximum for the focusing parameter, A = 1. It means that the additional phase correction, varying values of the coefficient A = 1, has not to be applied. The ultimate reconstructed and focused image of the observed object is present in Figure 8.3.12. The experiment – 2 is implemented in case the first scattering point alters its address during inverse aperture synthesis. Therefore, the measured and approximated values of the phase histories are not equal, i.e., ΔΦ(p) ≠ 0. To enhance the quality of the image, an autofocusing procedure is required. The highest resolution of the image is obtained for correction coefficient A =1. The distinguishing feature of the range compression procedure is that conjugation of the correlation function (emitted waveform) length and the
Inverse Synthetic Aperture Radars 287 0 Φ0(p), Φ(p) -0.5
×105 Measured Phase History Approximated Phase History
-1 -1.5 -2 -2.5 -3 -3.5 -4 -4.5 -5
0
50
100
150
200
250 300 p=1:500 (a)
350
400
450
500
×105 -4.265
Measured Phase History Approximated Phase History
-4.27 -4.275 -4.28
Φ(p)
Φ0(p)
-4.285 -4.29 -4.295 -4.3 435.5
436
436.5
437
p
437.5
438
438.5
439
(b)
Figure 8.3.9 Unwrapped phase history Φ0(p) and extrapolated curve Φ(p) (a), (b – in zoom mode).
288 Recognition and Perception of Images 80 ∆Φ(p)
70 60 50 40 30 20 10 0
0
50
100
150
200
250 300 p=1:500
350
400
2
2.5 A
450
500
350
35
300
30
250
25 C2(A)
C1(A)
Figure 8.3.10 Phase difference ΔΦ(p).
200
20
150
15
100
10
50
5
0
1
1.5
2
2.5 A
(a)
3
3.5
4
0
1
1.5
3
3.5
4
(b)
Figure 8.3.11 First cost function C1 = M(A) (a), and second cost function C2(A) (b).
length of ISAR signal is not necessary. In contrast, range compression by FFT requires the number of the FFT coefficients and the number of the samples of the ISAR signal to be conjugated. The advantage of the algorithm is that the image quality evaluation and autofocusing procedure are performed automatically. The results of the present section can be applied to build autofocusing system for ISAR imaging.
Inverse Synthetic Aperture Radars 289 120
100
80
60
40
20
20
40
60
80
100
120
Figure 8.3.12 Aircraft Mig-35 - final image, obtained after autofocusing procedure.
8.3.6 Barker Phase Code Modulation Waveform ISAR emits a series of Barker’s phase code modulated pulse waveforms, which can be described by the expression [Lazarov, 2012] N
S( p ) =
∑ p=1
a.rect
t − pTp ˆ )]}, .exp{− j[ω(t − pTp ) + πb(t − pTp − kT k Tb
(8.2.20)
where a is the amplitude of the emitted pulses, ω = 2πc/λ is the signal angular frequency, t = (k − 1)ΔT is the discrete current time k = 1, K k = 1, K is the sample number of the emitted Barker PCM burst; kˆ = 1, Kˆ is the segments’ index of the Barker’s PCM burst, Kˆ = T / Tk = 13 is the segments’ number of the Barker’s PCM burst, Tb is the time width of the Barker’s PCM burst, Tk is the segment’s time width of the Barker’s PCM burst, ΔT = Tk / m is the sample’s time width in the Barker PCM segment, m is the samples’ number of each PCM segment, K = Tb / ΔT is the samples number of the Barker PCM burst, ΔR = cTk / 2 is the dimension of the range
290 Recognition and Perception of Images ˆ ) resolution cell. For each p = 1, N the phase sign parameter b(t − pTp − kT k ˆ is defined by for t = pTp + kT k kˆ
0
1
2
3
4
5
6
7
8
9
10
11
12
ˆ ) b(t − pTp − kT k
0
0
0
0
0
1
1
0
0
1
0
1
0
Applying the physical optics principle of Huygens-Fresnel, the deterministic components of the ISAR signal for every p-th PCM burst are derived by
S( k , p ) =
I
J
i =1
j =1
∑∑a exp{− j[ω.tˆ + πb.tˆ ]}, ij
ij
(8.3.21)
ij
where aij is the reflection coefficient of the scattering point, tˆij = tij ,min ( p) + (k − 1) T − tij is the current fast time, k = 1, K + L is the samples’ index, and K + Lof the entire ISAR signal for each p-th burst, Rij(p) is the module of the distance vector to the ij-th scattering point. ˆ + t ), the Barker’s phase code parameter For t = pTp + kT k ij ˆ b(t − pTp − kTk − tij ) of the ISAR signal accepts values as follows. kˆ
0
1
2
3
4
5
6
7
8
9
10
11
12
ˆ −t ) b(t − pTp − kT k ij
0
0
0
0
0
1
1
0
0
1
0
1
0
8.3.7 Barker ISAR Image Reconstruction The image reconstruction procedure is accomplished by the following stages: signal registration with range alignment; phase demodulation and range compression and azimuth compression, image quality evaluation and autofocus phase compensation and final ISAR image reconstruction: a. the signal registration along with range alignment is performed by renumbering the range cells. The parameter k denoting the range bins where the ISAR signal sample is placed for each emitted burst, accepts values k = 1, K + L .
Inverse Synthetic Aperture Radars 291 b. the phase demodulation and range compression are accomplished by correlation procedure over the ISAR signal with reference function, complex conjugated to the emitted Barker’s code phase modulated signal. The correlation procedure can be written as
Sˆ(q, p) =
K +L
∑ S(k, p)S*(k − q, p)
(8.3.22)
k =1
Where K +L
S*(k − q, p) =
∑ exp{ j[ω((k − q)
ˆ )]}, T ) + πb((k − q) T − kT k
(8.3.23)
k =1
where p = 1, N , q = 1, K + L . c. Azimuth compression is accomplished over range compressed ISAR data by applying fast Fourier transform, i.e.,
I (q, pˆ ) =
N
ˆ
∑ Sˆ(q, p)exp − j 2πNpp ,
(8.3.24)
p=1
where q = 1, K + L , pˆ = 1, N .
8.3.8 Image Quality Criterion and Autofocusing As already stated, in order to realize image autofocusing, phase history of the first scattering point is measured, unwrapped, polynomial fitted (approximated), and a phase correction function as a difference between unwrapped phase history and its polynomial approximation is defined. Instead this heavy procedure in the present section is suggested to approximate the phase corΦ0 ( p ) , by the equation of the breaking line, defined by rection function, A
0, p ≤ p 0 , Φ( p) = ( p − p0 ) tanβ, p > p0
(8.3.25)
292 Recognition and Perception of Images where p0 = 1, N is the time instant of the inverse aperture synthesis, where the first scattering point changes its address, tanβ is the coefficient of the line of the phase correction function. The slant angle of the line β (in radians) accepts value in the interval from 0.01 to 1 with step Δβ=0.01. The quality of the ISAR image can be estimated automatically using a quality of image criterion, defined by special cost function. The cost function, C1(p0, β), is defined by full number of the pixels in the object space, M, which intensities exceed the value 0.8max[S( pˆ , q)] .
Numerical Experiment To verify the properties of the developed 2-D model of ISAR trajectory signal with Barker’s code phase modulation and to prove the correctness of the developed digital signal image reconstruction procedures including range compression, azimuth compression, and autofocus procedure a numerical experiment is carried out. Assume the object detected in 2-D coordinate system O’XY is moving rectilinearly in the coordinate system Oxy. The object trajectory parameters: the module of the vector velocity V = 600 m/s; the guiding angle of the vector velocity, α = π, the coordinates of the mass-center at the moment p = N / 2: x00(0) = 0 m, y00(0) = 5 × 104m. The ISAR parameters: the wavelength λ = 3 × 10−2 m; the carrying frequency f = 10 GHz; the burst repetition period for aperture synthesis Tp = 5 × 10−3s; PCM burst segment index kˆ = 1,13 ; PCM segments number Kˆ = 13 ; PCM sample time width ΔT = 3.3 × 10−9s; PCM burst time width Tb = 42.9 × 10−9s; range resolution ΔR = 0.5m; PCM burst samples number K = 13; the number of emitted burst during inverse aperture synthesis N = 500. The object geometry is depicted in a 2-D regular rectangular grid, described in the coordinate system O’XY. The dimensions of the grid’s cell ∆X = ∆Y = 0.5 m. The reference points number on the axis X is I = 64, on the axis Y is J = 64. Isotropic scattering points are placed at each node of the regular grid. The mathematical expectation of the normalized intensities of the scattering points placed on the object is aij = 0.1. The mathematical expectation of the normalized intensities of the scattering points placed out of the object is aij = 0.001. The computational results of real and imaginary components of the ISAR signal with Barker’s phase code modulation in a form of 3-D isometric projection are presented in Figure 8.3.13 (a), (b). Barker’s PCM ISAR signal (isometric projection) after correlation range compression, real part (a), and imaginary part (b) are depicted in Figure 8.3.14, (a), and (b).
Inverse Synthetic Aperture Radars 293
100 100 50
50 0
0
-50
-50
-100 -100 40 30 20 10 0
k
0
100
200
300
400
500
-150 40
30 20 10 0
k
p
0
(a)
100
200
300
400
500
p
(b)
Figure 8.3.13 ISAR signal with Barker’s phase code modulation in 3-D isometric projection: real part (a), and imaginary part (b).
600
400
400
200
200 0
0
-200
-200 -400 80 60 40 20 k
0
0
(a)
100
200 p
300
400
500
-400 80
60 40 20 k
0
0
100
200
300
400
500
p
(b)
Figure 8.3.14 Barker’s PCM ISAR signal (isometric projection) after correlation range compression: real part (a), and imaginary part (b).
The final aircraft F-18 image extracted by FFT azimuth compression as a distribution of normalized intensities of scattering points (isometric projection), and in a view of 2-D pseudo-color map (256 levels of grey color scale) are presented in Figure 8.3.15 (a), and (b), respectively. The ISAR image of the object is depicted in a grid of 64x64 pixels. In order to achieve imaging of high-resolution special phase compensation technique and quality of image evaluation, described in section 2.5, are applied. The phase correction function Φ(p) is calculated varying parameters p0 and β. Parameter p0 accepts values from p0 = 1 to p0 = 499, and for each p0 the parameter β varies from 0.01 to 0.1 with a step Δβ=0.01, family of phase correction functions with parameters p0 = 100, p0 = 250, p0 = 400, and β = 0.01 ÷ 0.1, Δβ = 0.01, are shown in Figure 8.3.16 (a), and (b).
294 Recognition and Perception of Images
1
60
0.8
50
0.6 40
0.4
30
0.2 0 80
20 60 40 20 0
k
100
0
200
300
500
400
10 10
p
20
30
(a)
40
50
60
(b)
Figure 8.3.15 ISAR image (isometric projection) after FFT azimuth compression (a), and reconstructed blurred image (aircraft F-18) in a view of 2-D pseudo-color map (b). C1 35 Φ(p)
400
p0 = 100 p0 = 250 p0 = 400
350 30 300 25
250
20
200
15
150
10
100
5
50
0 0
50
100
150
200
250 p
(a)
300
350
400
450
500
0 0.01
0.02
0.03
0.04
0.05
β
0.06
0.07
0.08
0.09
0.1
(b)
Figure 8.3.16 Family of phase correction functions (a) Φ(p), and three cases of the image cost function (b) C1(p0,β): p0 = 100, p0 = 250, p0 = 400.
First, an autofocus procedure is applied. Second, correlation range compression to phase corrected data is executed. Third, cross-range (azimuth) compression by Fourier transform is implemented to extract a final image. The quality of the extracted images is evaluated iteratively by computation of the image cost function C1(p0,β). In Figure 8.3.17 a 3-D view of a surface, described by the values of the image cost function C1(p0,β), is pictured. 2-D pseudo-color map of the same cost function is shown in Figure 8.3.17 b. It can be seen the optimal parameters, where the cost function, that estimates the quality of image has maximum, are p0 = 250 and β = 0.06.
Inverse Synthetic Aperture Radars 295 0.1 0.09
400
0.08 300
0.07
C1 200
β
0.06 0.05
100
0.04 0 0.1
0.03 0.05 β
0.01 0
100
200
300
400
500
0.02 0.01 50
p0
100 150
(a)
200
250 p0
300
350
400 450
500
(b)
Figure 8.3.17 3-D view of the surface, described by the image cost function C1(p0,β) (a) and pseudo-color map of C1(p0,β) (b).
How the parameter β influences the image quality can be seen in Figures 8.3.18 – 8.3.19. Different images (Figures 8.3.18-8.3.19), extracted by step varying angle β in the interval from 0.01 to 0.1, exhibit azimuth movement of the unfocused ISAR image from left to right until the image becomes focused. Based on the linearity of the correlation and Fourier transform, the proposed cost function C1(p0,β) estimates the power distribution of the reconstructed ISAR image and concentrates it in image pixels with maximum intensity, reducing noise level. As a conclusion, in the present section, 2-D ISAR geometry and rectilinear motion of the object is considered. 2-D model of ISAR signal with Barker’s 60
60
50
50
40
40
30
30
20
20
10
10 10
20
30 40 (a)
50
60
10
20
30 40 (b)
Figure 8.3.18 Aircraft F-18 image: p0 = 250, β = 0.01 (a), β = 0.02 (b).
50
60
296 Recognition and Perception of Images 60
60
50
50
40
40
30
30
20
20
10
10 10
20
30 (a)
40
50
60
10
20
30 40 (b)
50
60
Figure 8.3.19 Aircraft F-18 image: p0 = 250, β = 0.05- unfocused (a), β = 0.06-focused (b).
phase code modulation is developed. For Barker’s phase code demodulation and range compression over the time record ISAR data a correlation procedure with reference function equal to the Barker’s phase code modulated emitted waveform is performed. Fast Fourier transformation for azimuth compression, which is the ultimate image reconstruction procedure over the range compressed ISAR data, is implemented. The operation of range alignment is accomplished by immediate renumbering the ISAR signals in the range bins. To verify the ability and correctness of the proposed ISAR signal model and developed algorithms numerical experiment is carried out.
8.4 3-D ISAR Signal Models and Image Reconstruction Algorithms 8.4.1 Stepped Frequency Modulated ISAR Signal Model Stepped frequency modulated (SFM) bursts of waveforms are used to illuminate a moving object, and to extract images with high range resolution. Consider a mono static ISAR system. The ISAR system emits a series of SFM bursts, each of which is described by the expression N
s(t ) =
M
∑∑ rect p=1 m=1
t − tmp exp[ j 2πfmt ], Tb
(8.4.1)
Inverse Synthetic Aperture Radars 297 where
1, 0 < (t − t )/ T ≤ 1; mp b rect[(t − tmp )/ Tb ] = 0 , otherwize
(8.4.2)
fm = f0 + (m − 1)Δf is the frequency of the pulse centered at time tmp, defined by tmp = ((m – 1) + (p − 1)M)T; Δf is the frequency difference for each step in the pulse burst; Tb is the time duration of the burst; M is the number of pulses in each burst; p = 1, N is the index of emitted burst; T is the segment time width in the burst. The ISAR signal reflected by ijk-th scattering point from the object is described by
tˆijk ( p) sijk ( p, m) = aijk rect × exp[ j 2πfm−r +1 (tˆijk ( p))], (8.4.3) Tb
tˆ ( p) tˆijk ( p) 1, if 0 ≤ ijk 0,
(8.4.16)
ˆ ( p)) − H (Φ ˆ H (Φ q ,r q ,r +1 ( p )) < 0.
(8.4.17)
and stops if
In order to calculate more accurate value of the phase correction line slope the search procedure continues as follows. Let αqr be the angle of the ˆ ( p) is the phase correction function yieldphase correction line and Φ qr ing an image with minimal entropy. The search of the precise value of the angle of the phase correction line continues with the angle α ′qr that accepts values as follows
α ′qr = α qr ± r ′
π , 1800
(8.4.18)
where r ′ = 1, 10 . Then the exponential phase correction functions accept values
ˆ ′ ( p) = 0, if p ≤ q; Φ qr ( p − q)tan α ′qr , otherwise.
(8.4.19)
After each determination of α ′qr a normalized image and entropy image function is calculated. The optimal value of α ′qr corresponds to the minimal entropy image function.
Numerical Experiments To verify the properties of the developed 3-D model of ISAR trajectory signal with stepped frequency modulation and to prove the correctness of developed digital signal image reconstruction procedures including range alignment, range compression and azimuth compression the numerical experiment is carried out. It is assumed that the object is moving
302 Recognition and Perception of Images rectilinearly in a 3-D Cartesian coordinate system of observation Oxyz. The trajectory parameters are: the module of the vector velocity V = 300 m/s; α = 1.17π; β = 0.67π; γ = arccos 1 − cos 2 α − cos 2 β . The initial coordinates of the mass-center: x00(0) = 0 m; y00(0) = 2 × 104m; z00(0) = 3 × 103 m. The ISAR emitted SFM bursts parameters are as follows. The wavelength is λ = 3 × 10−2 m. The time duration of the emitted SFM burst is T = 10−6 s. The number of pulses in SFM emitted burst is M =128. The carry frequency is f = 1010 Hz. The time duration of the SFM pulse is ΔT = 0.781 × 10−8 s. The total bandwidth of SFM signal is ΔF = 1.5 × 108 Hz. The number of emitted bursts during inverse aperture synthesis is N = 128. The frequency difference for each step in the burst is Δf = 1.172 × 106 Hz. The burst repetition period Tp = 15.6 × 10−3 s. In order to create ISAR signal model the whole geometry of the object needs to be enclosed in a 3-D regular grid, described in the coordinate system O’XYZ (Figure 8.4.1). The dimensions of the grid’s cell are ∆X = ∆Y = ∆Z =0.5 m. The number of grid’s reference points on axes O’X, O’Y and O’Z are I = 64, J = 64, and K = 5, respectively. Real and imaginary parts of the ISAR signal are presented in Figure 8.4.2. Real and imaginary parts of the ISAR signal after demodulation are presented in Figure 8.4.3. Isometric 3-D ISAR image described by intensities of scattering points is presented in Figure 8.4.4.
20 15 10 5 0 -5 -10 20
-15 -20 -20
10 0 -10
0
10
Figure 8.4.1 3-D geometry of the aircraft F18.
-10 20
-20
Inverse Synthetic Aperture Radars 303
400
300
200
200 100
0
0 -100
-200
-200
-400 150
-300 150
100
100
50
100 50
50
0 0
k
150
0 0
k
p
50 p
100
150
Figure 8.4.2 Real and imaginary parts of the ISAR signal.
3000
3000
2000
2000
1000
1000
0
0
-1000
-1000
-2000
-2000 -3000 150
-3000 150 150
100
100
50 k
150
100
0
0
100
50
50
0
k
p
50 0
p
Figure 8.4.3 Real and imaginary parts of the ISAR signal after demodulation.
1 0.8 0.6 0.4 0.2 0 150 150
100
100
50 k
50 0
Figure 8.4.4 Isometric 3-D ISAR image.
0
p
304 Recognition and Perception of Images Final ISAR image and power normalized ISAR image (pseudo-color maps) are presented in Figure 8.4.5. Different stages of ISAR image autofocusing by parameter p = 58, are presented in Figures 8.4.6 - 8.4.7. The autofocusing procedure starts by q = 1 and ends by q = 127. The parameter r accepts values from 1 to 89. In each of the figures (Figures 8.4.7 - 8.4.8) a phase correction function, a focused image, an entropy function and entropy difference, are presented. A focusing displacement of the ISAR image in the frame can be noticed. The behavior of the curve of the entropy function and average entropy difference corresponds to the process of achieving a focused image. The bold line in figures defines an optimal phase correction function, obtained by p = q = 58 and r = 46. The thin line defines current phase correction function. The ISAR image with minimal entropy (H = 6.198) and zero entropy difference is of a best quality. A surface of the entropy function using as a cost function in the autofocusing procedure of the ISAR image is depicted in Figure 8.4.9. A minimal value of the entropy function is Hmin = 6.198 for p = 58 and r = 46. As a conclusion, in this section based on the analytical geometrical approach 3-D ISAR geometry is described and a three-dimensional discrete model of ISAR signals with stepped frequency modulation is created. The object moving rectilinearly, is depicted in a 3-D regular grid of isotropic scattering points. The image reconstruction procedure Normalized unfocused image
120
120
100
100
80
80 k
k
Original unfocused image
60
60
40
40
20
20 20
40
60 80 p (a)
100 120
20
Figure 8.4.5 Final image (a) and power normalized image (b).
40
60 80 p (b)
100 120
Inverse Synthetic Aperture Radars 305 80
Focus procedure
Phase correction function: p = 58, r = 30 120
Optimal angle Current angle
70 60
100 80
40 k
Phase
50 30
60
20
40
10
20
0 -10
20
7
40
60 p
80
100 120
20
40
60 p
80
100 120
Entropy difference delta H = -0.010836 0.1
Entropy H = 6.4114
6.8
0.05
6.6 0 6.4 -0.05
6.2 6
20
40
r
60
80
-0.1
20
40
r
60
80
Figure 8.4.6 Autofocusing by p = q = 58 and r = 30.
includes both range and azimuth compression. A cross-correlation procedure over stepped modulated ISAR signal is described and applied to realize range compression and FFT to realize azimuth compression. In order to achieve an image of high quality a range alignment and a phase correction entropy autofocusing technique are applied. To illustrate the computational capability of the 3-D ISAR signal model and the image reconstruction procedure including entropy minimization autofocusing technique a numerical experiment is performed. The final reconstructed ISAR image with high resolution is obtained by applying a phase correction autofocusing function with parameters p = q = 58 and r = 46.
306 Recognition and Perception of Images 80
Focus procedure
Phase correction function: p = 58, r = 46 120
Optimal angle
70
Current angle
60
100 80
40 k
Phase
50 30 20
40
10
20
0 -10
60
20
7
20
40
60 80 100 120 p Entropy H = 6.1981 0.1
6.8
40
60 80 p
100 120
Entropy difference delta H = -0.033238
0.05
6.6 0 6.4 -0.05
6.2 6
20
40
r
60
80
-0.1
20
40
r
60
80
Figure 8.4.7 Autofocusing by p = q = 58 and r = 46.
8.4.3 Complementary Codes and Phase Code Modulated Pulse Waveforms a. Complementary codes synthesis Empirically obtained complementary code sequences with ideal autocorrelation function with zero side lobes are suggested in [Bedzhev B. Y, Tasheva Zh.N., Mutkov V.A., 2003]. The synthesis of complementary codes with unlimited length requires initial first couples of codes with dimension n and second couples of codes with dimension r to be created. The first couple of codes, presented as row vector matrices, has the form
А = [ A(1) A(2) ... A(n)]; B = [ B(1) B(2) ... B(n)]. (8.4.20)
Inverse Synthetic Aperture Radars 307 80
Focus procedure
Phase correction function: p = 58, r = 60 120
Optimal angle
70
Current angle
60
100 80
40 k
Phase
50 30 20
60 40
10
20
0 -10
20
40
60 p
80
Entropy H = 6.472
7
20
100 120
0.1
6.8
40
60 p
80
100 120
Entropy difference delta H = 0.047388
0.05
6.6 0 6.4 -0.05
6.2 6
20
40
r
60
80
-0.1
20
40
r
60
80
Figure 8.4.8 Autofocusing by p =q = 58 and r = 60.
The second couple of codes, presented as row vector matrices, has the form
C = [ C(1) C(2) ... C(r)]; D = [ D(1) D(2) ... D(r)]. (8.4.21)
To generate the elements of new codes with unlimited length two algorithms can be applied: Algorithm 1:
Anew = [C(1).A C(2).A ...C(r).A...D(1).B D(2).B.....D(r).B]; (8.4.22) Bnew = [D*(r).AD*(r-1).A...D*(1).A….-C*(r).B -C*(r-1).B...-C*(1).B].
308 Recognition and Perception of Images 7.5
7
6.5
6 100
80
60
40
20
000
100
0
p
r
Figure 8.4.9 A surface of the entropy function using as a cost function in the autofocusing of the ISAR image (Hmin = 6.198 for p = 58 and r = 46).
Algorithm 2:
A’new = [ C(1).A
D(1).B C(2).A D(2).B.…C(r).A D(r).B]; (8.4.23)
B’new = [ D*(r).A -C*(r).B D*(r-1).A -C*(r-1).B..…D*(1).A *(1).B]. where the asterisk (*) denotes a complex conjugate value. b. Complementary Phase Code Modulated pulse waveforms ISAR emits two consecutive series of complementary phase code modulated (CPCM) pulses, each p- th CPCM pulse is described by [Lazarov, Minchev, 2009]
S1 ( p) = rect
t t exp{− j[ωt + πb1 (t )]}; S2 ( p) = rect exp{− j[ωt + πb2 (t )]}, T T (8.4.24)
Inverse Synthetic Aperture Radars 309
t 1, if 0 ≤ < 1 t T rect = T t t 0, if 0 > , > 1 T T
(8.4.25)
where ω = 2πc/λ is the angular frequency; t = (k − 1)ΔT is the discrete current time, k = 1, K is the segment index of CPCM emitted pulse waveform; K = T/ΔT is the full number of phase pulses in the emitted CPCM emitted pulse waveform, ΔT is the time width of the CPCM segment, T is the time duration of the CPCM pulse waveform, ΔR = cΔT/2 is the dimension of the range cell. For each emitted pulse p = 1, N the phase sign parameters b1 and b2 are defined as follows k
0
1
2
3
4
5
6
7
8
9
b1(k)
1
1
1
1
1
0
1
0
0
1
k
0
1
2
3
4
5
6
7
8
9
b2(k)
1
1
0
0
1
1
1
0
1
0
8.4.4 ISAR Complementary Phase Code Modulated Signal Modeling a. ISAR CPCM signal formation The process of ISAR signal formation for the first of both the complementary codes is presented in Figure 8.4.10. The parameters tijk,min, tijk,1, tijk,2, tijk,3, tijk,4 are the time delay of the ISAR signal reflected from the nearest, first, second, third and fourth scattering points, respectively. Each phase complementary code consists of 7 phase pulses. The number of samples of ISAR signal reflected from each scattering point is equal to the number of phase pulses. Parameter tijk denotes the current number of sample while the ISAR signal reflected from ijk-th scattering point is detected. At the moment tijk,min the ISAR signal from the nearest scattering point, placed in the range bin kijk = 1 is detected. In the range bin k = 1 only ISAR signal from nearest scattering point presents. At the moment tijk,1 and tijk,2 the ISAR signals from the first and second scattering points, placed in the range bin kijk = 2 are detected. In the current bin k = 2 the three signals
310 Recognition and Perception of Images 1
2
4
3
kijk = 2 1 2
5 4
3
kijk = 2 1 2
6 5
6
4
3
7
5
7 6
7
kijk = 3 1
2
4
3
5
6
7
kijk = 3 1 1 min 1 tijk
2
2 2
4
3 3
4
3
4
5
5 5
6
6 6
7
7 7
8 8
9 9
k
1-7-numbers of phase pulses
1 tijk 2 tijk
k-current phase pulse number
3 tijk
k=kijk-phase pulse number of detecting of ijk-th ISAR signal Time duration of the 4th signal
4 tijk
Time duration of the interference resulting signal
Figure 8.4.10 ISAR signal formation with complementary phase code modulation (first code).
(from nearest, first and second scattering points) interfere and take place in the sum. At the moment tijk,3 and tijk,4 the ISAR signals from the third and fourth scattering points, placed in the range bin kijk = 3 are detected. On the current sample index k = 3 the fifth signals (from nearest, first, second, third and fourth, and fifth scattering points) interfere and take place in the sum. On the current sample index k = 8 the ISAR signal from nearest scattering point goes out of the sum, i.e., does not take part in the sum anymore. On the current sample index k = 9, the ISAR signals from first and second scattering point go out of the sum. For each pth emitted CPCM pulse waveform the deterministic components of the ISAR signal consisting of two complementary sequences, reflected by the object scattering points have the form
Inverse Synthetic Aperture Radars 311 I
S10 ( p, k ) =
J
K
∑∑∑a
ijk
i =1
rect
j =1 k =1
t − tijk ( p) exp{− j[ω(t − tijk ( p)) + πb1 (t − tijk )]} T
(8.4.26) I
S20 ( p, k ) =
J
K
∑∑∑a
ijk
i =1
j =1 k =1
t − tijk ( p) rect exp{− j[ω(t − tijk ) + πb2 (t − tijk )]} T
(8.4.27)
where
t − tijk ( p) t − tijk ( p) 1, 0 ≤ < 1; rect = T T 0, otherwize
t = tijk,min(p) + (k – 1)ΔT – tijk is the current fast time parameter; tijk(p) satisfies the inequality t = tijk,min(p) + (k – 1)ΔT – tijk ≥ 0, aijk is the reflection coefficient (intensity) of the scattering point: aijk = 0 in case ijkth is placed out of the object;
aijk ≠ 0 in case ijkth is placed on the object, k = 1, K + L is the sample index 2 Rijk ( p) of the ISAR signal, tijk ( p) = is the time delay of the ISAR signal, c t ( p) − tijk ,min ( p) reflected by ijkth object’s point, L = ijk ,max is the relative time T dimension of the object, tijk,min(p) = 2Rijk,min(p)/c is the minimum time delay of the ISAR signal reflected from the object, tijk,max(p) = 2Rijk,max(p)/c is the maximum time delay of the ISAR signal reflected from the object, Rijk(p) is the module of the distance vector to the ijkth scattering point defined by the expression.
8.4.5 ISAR Image Reconstruction Procedure The image reconstruction procedure is accomplished following the next stages: signal registration with range alignment; phase demodulation and range compression and azimuth compression, image quality evaluation and autofocusing phase compensation and final ISAR image reconstruction. a. The signal registration with range alignment is performed by renumbering the range cells. It means the parameter k
312 Recognition and Perception of Images denoting the range bins where the ISAR signal is placed for each emitted pulse accepts values k = 1, K + L . b. The phase demodulation and range compression are accomplished by correlation procedure over the ISAR signal with a reference function, complex conjugated to the emitted complementary code phase modulated pulse waveform, i.e., Sˆ1 ( p, q) =
K +L
∑
S1 ( p, k )S1* ( p, k − q); Sˆ2 ( p, q) =
k =1
K +L
∑ S ( p, k)S* ( p, k − q), 2
2
k =1
(8.4.28)
where
S* ( p, k − q) = 1
K +L
∑ exp{ j[ω((k − q)∆T ) + πb ((k − q)∆T − kTˆ )]}, k
1
(8.4.29)
k =1
∑ exp{ j ω ((k − q) ∆T ) + +πb ((k − q) ∆T − kT )} K +L
∗ 2
S ( p, k − q) =
k
2
k =1
(8.4.30)
where p = 1, N , q = 1, K + L . c. Azimuth compression is accomplished over range compressed ISAR data by applying Fourier transforms, i.e.,
I1 ( pˆ , q) =
∑ p=1
I 2 ( pˆ , q) =
N
2πppˆ Sˆ1 ( p, q)exp − j , N
N
ˆ
∑ Sˆ ( p, q)exp − j 2πNpp ,
where q = 1, K + L , p = 1, N .
2
p=1
(8.4.31)
(8.4.32)
Inverse Synthetic Aperture Radars 313
Numerical Experiment To verify the properties of the developed 2-D model of ISAR signal with complementary code phase modulation and to prove the correctness of the developed digital signal image reconstruction procedures including range compression and azimuth compression a numerical experiment is performed. It is assumed that the target is moving rectilinearly in a 2-D Cartesian coordinate system of observation, Oxy, and is detected in 2-D coordinate system O′XY. The target trajectory parameters are as follows: the vector velocity module V = 400 m/s; the guiding angle of the vector velocity, α = π; the angle between coordinate axes, φ = 0; the initial coordinates of the geometric center x00(0) = 0 m, y00(0) = 5 × 104 m. The ISAR emitted CPCM pulse parameters: the wavelength λ = 3 × 10−2 m; the carrying frequency f = 1010Hz; the time repetition period for signal registration (aperture synthesis) is Tp = 7.5 × 10−3s; the time duration of a segment of the CPCM pulse is Tk = 3.3 × 10−9s; the CPCM segment index kˆ = 1,10 ; the CPCM number of segments Kˆ = 10 ; the time width of the CPCM sample ΔT = 3.3 × 10−9s; the time duration of the emitted CPCM pulse is T = 33 × 10−9s; the dimension of the range resolution cell is ΔR = 0.5 m; the number of samples of PCM transmitted signal is K = 10; the CMPM sample number of the emitted PCM pulse k = 1,10 ; the number of transmitted pulses during inverse aperture synthesis is N = 500. The modeling of ISAR signal requires the geometry of the target to be enclosed in a 2-D regular rectangular grid, described in the coordinate system O′XY. The dimensions of the grid’s cell are ΔX = ΔY = 0.5m. The number of the reference points of the grid on the axes X and Y are I = 64 and J = 64, respectively. Isotropic scattering points are placed at each node of the regular grid. The mathematical expectation of the normalized intensities of the scattering points placed on the target is aij = 10−2. The mathematical expectation of the normalized intensities of the scattering points placed out of the target is aij = 10−5. The real and imaginary components of the ISAR signal, modulated by complementary phase codes A and B have been calculated and presented in a form of 3-D isometric projections. The real (a) and imaginary (b) component of the ISAR signal, modulated by the first code A of the phase code complementary pair in 3-D isometric projection is depicted in Figure 8.4.11. The real and imaginary component of the ISAR signal modulated by second code of the phase code complementary pair in 3-D isometric projection is depicted in Figure 8.4.12. The real and imaginary part of the ISAR signal, modulated by first code A and second code B of the phase code
314 Recognition and Perception of Images
150
150
100
100
50
50
0
0
-50
-50
-100 -150 40 30 20 10 0
k
100
0
200
300
400
500
-100 40
30 20 10 0
k
p
0
(a)
100
200
300
400
500
p
(b)
Figure 8.4.11 Real (a) and imaginary (b) component of the ISAR signal modulated by first code A of the phase code complementary pair in 3-D isometric projection.
60
60
40
40
20
20
0
0
-20
-20
-40
-40
-60 40
-60 40
30
20
10
k
100
0 0
200
300
400
500
30
20
10
k
p
(a)
100
0 0
200
300
400
500
p
(b)
Figure 8.4.12 Real (a) and imaginary (b) component of the ISAR signal, modulated by second code B of the phase code complementary pair in 3-D isometric projection.
600
150
400
100
200
50
0
0
-200
-50
-400 80
60
40 k
20
0 0
(a)
100
200 p
300
400
500
-100 40
30
20 k
10
0 0
100
200
300
400
500
p
(b)
Figure 8.4.13 Real (a) and imaginary (b) part of the ISAR signal, modulated by first code A of the phase code complementary pair, after correlation range compression.
Inverse Synthetic Aperture Radars 315 complementary pair have been calculated by applying correlation range compression procedure and presented in a form of 3-D isometric projection. The real and imaginary part of the ISAR signal, modulated by first code A of the phase code complementary pair, after correlation range compression in form of 3-D isometric projection is presented in Figure 8.4.13. The real and imaginary part of the ISAR signal, modulated by second code B of the phase code complementary pair, after correlation range compression in form of 3-D isometric projection is presented in Figure 8.4.14. A target image as normalized intensities of the scattering points have been calculated by applying Fourier transform and module procedure over the range compressed ISAR signal, modulated by first code A and second code B of the phase code complementary pair, respectively. The image of aircraft MiG-35 extracted from the CPCM ISAR signal, modulated by first code A of the phase code complementary pair in a form of 2-D pseudo-color map (256 levels of grey color scale) is depicted in Figure 8.4.15, a. The image of the aircraft MiG-35 extracted from the CPCM ISAR signal, modulated by second code B of the phase code complementary pair in a form of 2-D pseudo-color map (256 levels of grey color scale) is depicted in Figure 8.4.15, b. The interference of both images, presented in Figures 8.4.15, a and 8.4.15, b bears the final image of the aircraft MiG-35 depicted in Figure 8.4.16, a. As can be seen the ultimate image retrieved from the ISAR signal, modulated by phase code complementary pair, possesses a higher resolution in comparison with the images obtained from the ISAR signal, modulated by first A and second B phase code. In order to estimate the quality of ISAR image, obtained from ISAR signal modulated by complementary phase code and to evaluate the capability of the complementary phase code modulation in comparison with
200
200
100
100
0
0
-100
-100
-200 80 60 40 20 k
0
0
(a)
100
200 p
300
400
500
-200 80 60 40 20 k
0
0
100
200
300
400
500
p
(b)
Figure 8.4.14 Real (a) and imaginary (b) part of the ISAR signal, modulated by second code of the phase code complementary pair, after correlation range compression.
316 Recognition and Perception of Images Code 1
Code 2
80
80
70
70
60
60
50
50
40
40
30
30
210
220
230
240
250
260
270
210
220
240
230
(a)
250
260
270
(b)
Figure 8.4.15 The image of the aircraft MiG-35, obtained from ISAR signal, modulated by first code A (a) and second cod B (b) of the phase code complementary pair in a form of 2D pseudo-color map (256 levels of grey color scale). Final Image
Barcer Code
80
80
70
70
60
60
50
50
40
40
30
30
210
220
230
240 (a)
250
260
270
210
220
230
240 (b)
250
260
270
Figure 8.4.16 Aircraft MiG-35 image (256 levels of grey color scale) from CPCM ISAR signal (a) and aircraft MiG-35 image from Barker PCM ISAR signal (b).
another kind of phase code modulation, the final image, retrieved from the Barker’s phase code modulated ISAR, signal is calculated. It can be noticed that the final image extracted from complementary phase code modulated ISAR signal is significantly better in comparison with the image retrieved from the Barker’s phase code modulated ISAR signal. A reconstructed image of the aircraft MiG-35 from ISAR
Inverse Synthetic Aperture Radars 317 signal, modulated by Barker’s phase code is presented in a view of 2-D pseudo-color map (256 levels of grey color scale) in Figure 8.4.16, b. As can be seen, the side lobe levels of the Barker’s image are higher than the side lobe levels of the image reconstructed from ISAR signal with complementary phase code modulation.
8.4.6 Parametric ISAR Image Reconstruction Based on LFM ISAR signal model and 2-D geometry an original parametric ISAR image reconstruction algorithm called a recurrent Kalman filtration is considered in the present section. The algorithm is based on the following vector measurement and state equations. a. Vector Measurement Equation and State Equation
ξ(p,k) = S[p,k,a(p)] + n(p,k) a(p) = g[p,k,a(p − 1)] + n0(p,k)
(8.4.33)
where ξ(p,k) is the [2(K + L);1] column vector, S[p,k,a(p)] is the deterministic process in the field of vector arguments a(p), and yields a column-vector with dimensions [2(K + L);1]; g[p,k,a(p − 1)] is a column-vector function that describes the variation of the vector arguments at discrete time moments yielding dimensions of [I × J;1]; n(p,k), n0(p,k) are sequences of random vector values with zero expectation and covariance matrices ψ(p,k) with dimensions [2(K + L);2(K + L)] and V(p) with dimensions [I × J;I × J], respectively. The vector a(p), with dimensions [I × J;1] accounts for a vector estimates of intensities of scattering points aij . In order to model quadrature components of the ISAR trajectory signal, it is supposed that in the Cartesian coordinate space Oxyan object moves with a rectilinear trajectory at a constant speed V. The object is situated in a coordinate grid whose origin may coincide with the geometric center of the object. The shape of the object is described by the intensities (reflection coefficients) aij of scattering points distributed in accordance with its geometry. b. Approximation Functions In the general case, the functions S[p,k,a(p)] and g[p,k,a(p − 1)] in (8.4.32) are non-linear. This circumstance would lead to ambiguity in invariant parameter
318 Recognition and Perception of Images definition. One of the main purposes of the present work is to reveal the composition of S[p,k,a(p)] and g[p,k,a(p − 1)] under linear approximation and to develop an algorithm for a quasi-linear Kalman filtration of the invariant vector parameters in the complex amplitude of the ISAR trajectory signal. The approximation function S[p,k,a(p)] is defined by the quadrature components of a complex signal, reflected by the scattering points, placed on the nods of a uniform grid, i.e.,
S[p,k,a(p)] = Sc[p,k,a(p)] + jSs[p,k,a(p)].
(8.4.34)
The Taylor’s expansion after ignoring the higher order terms results in the following linear equations. J
S c [ p, k , a( p)] = S c [ p, k , a ( p)] +
I
∑∑ ∂S [ p∂,ak, a( p)](a − a )
c
j =1
ij
ij
i =1
J
Ss[ p, k , a( p)] = Ss[ p, k , a ( p)] +
I
(8.4.35)
ij
∑∑ ∂S [ p∂,ak, a( p)](a − a )
s
j =1
ij
ij
i =1
(8.4.36)
ij
p) = [a11 …a1 J , a 21 …aij …aIJ ]T is the vector-estimates of the invariwhere a( ant geometrical parameters with dimensions [I × J;1], the superscript [.]T denotes matrix transpose, the product I × J denotes the full number of estimates of isotropic scattering points of the grid with intensity aij. The constant coefficients of the Taylor expansion is defined by J
S c [ p, k , a ( p)] = Ss[ p, k , a ( p)] =
I
∑∑a cos[ω(t − t ) + b(t − t ) ], ij
j =1
i =1
J
I
ij
ij
2
∑∑a sin[ω(t − t ) + b(t − t ) ]. ij
j =1
i =1
ij
ij
2
(8.4.37)
(8.4.38)
Inverse Synthetic Aperture Radars 319 The coefficients of the linear terms of the Taylor’s expansion (8.4.33) and (8.4.34) defined by the expressions
∂S c [ p, k , a ( p)] = cos[ω(t − tij ) + b(t − tij )2 ] ∂aij
(8.4.39)
∂Ss[ p, k , a ( p)] = sin[ω(t − tij ) + b(t − tij )2 ]. ∂aij
(8.4.40)
If the vector-estimated parameters are Gaussian and Markov, the state transition matrix function g[p,k,a(p − 1)], linking the vector estimates of invariant parameters in two consecutive moments in linear approximation, is given by the expression
g[ p, k , a( p − 1)] = g ( p, k ).a ( p − 1)
(8.4.41)
NTp where g ( p, k ) = diag exp is the diagonal matrix with dimensions τij [I × J; I × J], τij is the correlation time of the parameter aij. If the observation time, NTp, is considerably less than the correlation time, τij, then the state transition matrix g(p,k) becomes approximately an identity matrix, i.e., the estimated geometrical parameters are invariant in the ISAR observation time interval. c. Recurrent Kalman Procedure The modified recurrent Kalman procedure for quasilinear estimation of the invariant parameters can be defined as follows
a ( p) = g ( p, k ).a ( p − 1) + K( p, k ){ξ( p, k ) − S[ p, k , a ( p − 1)]}
(8.4.42)
where
ξ(p,k) = [ξc(p,k), ξs(p,k)]T
(8.4.43)
320 Recognition and Perception of Images is the new measurement vector with dimensions [2(K + L);1];
S [ p, k , a( p − 1)] c S[ p, k , a( p − 1)] = Ss[ p, k , a( p − 1)]
(8.4.44)
is the measurement prediction vector with dimensions [2(K + L);1];
K(p,k) = R(p,k)HT(p,k)ψ−1(p,k)
(8.4.45)
is the Kalman filter gain - matrix with dimensions [I × J;2(K + L)]; R(p,k) is the update state error covariance matrix with dimensions [I × J; I × J], determined by
R−1(p,k) = HT(p,k)ψ−1(p,k)H(p,k) + [gT(p,k)R(p – 1, k)g(p,k) + V−1(p,k)]−1
(8.4.46)
where
H( p, k ) =
............ ........... ∂S c [ p, K , a ( p − 1)] ∂S c [ p, K , a ( p − 1)] .... . ∂aIJ ∂a111 ............ ........... ∂Ss[ p,1, a ( p − 1)] ∂Ss[ p,1, a ( p − 1)] .... ∂a11 ∂aIJ .............. ...................... ∂Ss[ p, K , a ( p − 1)] ∂Ss[ p, K , a ( p − 1)] .... ∂aIJ ∂a11 ∂S c [ p,1, a ( p − 1)] ∂S c [ p,1, a ( p − 1)] .... ∂aIJ ∂a11
(8.4.47)
is the state-to-measurement transition matrix with dimensions [2(K + L);I × J].
Inverse Synthetic Aperture Radars 321 The elements of the matrix H(p,k) for each p = 1, N can be generally described by expressions:
hkc ,(i−1) J + j =
∂S c [ p, k , a ( p − 1)] = cos[ω(t − tij ) + b(t − tij )2 ], ∂aij
(8.4.48)
hks ,(i−1) J + j =
∂Ss[ p, k , a ( p − 1)] = sin[ω(t − tij ) + b(t − tij )2 ]. ∂aij
(8.4.49)
The matrix R(p – 1, k) is the predicted state error covariance matrix with dimensions [I × J; I × J]. In the beginning of the procedure p = 1 the initial predicted state error covariance matrix R(0,k) is an identity matrix. The process noise covariance matrix V(p,k) and the measurement covariance N0 matrix ψ(p,k) are diagonal with elements 2T , where N0 is the spectral p density of the Gaussian noise.
Numerical experiment To substantiate the properties of the developed ISAR signal model and to verify the correctness of the developed Kalman image reconstruction procedure a numerical experiment is carried out. It is assumed that the object is moving rectilinearly in a 2-D observation Cartesian coordinate system Oxy and is detected in 2-D coordinate system OXY. The trajectory parameters of the object are: the module of the vector velocity V = 600 m/s; the guiding angle of the vector velocity, α = π; the angle between coordinate axes, φ = 0; Тhe initial coordinates of the object geometric center: x00(0) = 0 m, y00(0) = 5 × 104 m. The ISAR emitted LFM waveforms: the wavelength is λ = 3 × 10−2 m, the pulse repetition period is Tp = 2.5 × 10−2s, the time duration of the emitted LFM pulse waveform is T = 10−6s, the LFM samples number of the signal is K =32, the carrying frequency is f = 10 GHz, LFM sample time width is ΔT = 3.125 × 10−8s, the LFM waveform’s bandwidth is ΔF = 3 × 108Hz, the LFM rate is b = 9.4 × 1014, the number of emitted pulses during inverse aperture synthesis is N = 100, the object geometry is depicted in a 2-D regular rectangular grid - coordinate system O’XY (Figure 8.4.17), the dimensions of the grid’s cell are ∆X = ∆Y = 0.5 m, the number of the grid’s reference points on the axis X is I = 20, and on the axis Y is J = 20. Scattering points are placed at each node of the grid. The intensities of the scattering points placed on the object are aij = 10−2. The intensities of the scattering points placed out of
322 Recognition and Perception of Images 20 18 16 14 12 10 8 6 4 2 0
0
2
4
6
8
10
12
14
16
18
20
Figure 8.4.17 Discrete structure of the object depicted in the space of the regular grid.
the object are aij = 10−3. The arguments of approximation functions are the intensities of scattering points with amplitudes less than aij = 10−3. The results of the numerical experiment for different stages of the recurrent Kalman procedure are presented in Figure 8.4.18. As can be seen the quality of the ISAR images is improved with each iteration step. It proves the correctness of the ISAR signal model and image reconstruction capability of the Kalman method. The suggested Kalman algorithm distinguishes with high calculation speed, steady and quick convergence of computations. Therefore, it could be successfully exploited for extracting geometrical parameters from the LFM ISAR data. In conclusion, vector measurement and vector state equations are defined and used to create an ISAR imaging algorithm. Based on 2-D ISAR geometry and LFM signal modeling, the approximation functions of a recurrent Kalman image reconstruction algorithm are defined. The image reconstruction is realized by quasilinear estimation of geometrical parameters – intensities of object’s scattering points. Simulation experiments illustrate the capability of the recurrent method for ISAR imaging. The recurrent Kalman procedure demonstrates high effectiveness in the object image reconstruction using reduced number of ISAR measurements.
Inverse Synthetic Aperture Radars 323 20
20
15
15
10
10
5
5 5
10 (a)
15
20
5
20
20
15
15
10
10
5
5 5
10 (c)
15
20
5
10 (b)
10 (d)
15
15
20
20
Figure 8.4.18 Reconstructed ISAR image: by step p = 1 (a), by step p = 2 (b), by step p = 40 (c), by step p = 100 (d).
8.5 Conclusions In the present study 3-D and 2-G ISAR geometry, kinematic and geometric transformations are analytically described. It is illustrated how 3-D ISAR scenario of the aperture synthesis is transformed to 2-D geometry of ISAR signal registration, and decomposition of the object’s rectilinear movement into translation displacement of the object geometric center and object’s rotation around it. Analytical expressions of wideband waveforms with liner frequency, stepped frequency, and phase code modulation applied in 3-D and 2-D ISAR signal modeling are provided. Nonparametric instruments as Fourier transforms and correlation algorithms are developed and applied for ISAR image reconstruction and illustrated by numerical experiments. Special attention is paid to phase correction techniques realized by autofocusing algorithms. The autofocusing function is constructed by tracking nearest or first scattering from
324 Recognition and Perception of Images the object space. The defocusing mechanism is explained by changing the address of the first scattering point during aperture synthesis. An emphasis is made on the parametric Kalman image reconstruction. The basic components of measurement and state vector equations are derived and used in the image reconstruction algorithm illustrated by numerical results.
Acknowledgment This work was supported by the National Science Fund of Bulgaria under contract FNI KP-06-N37/1, entitled “Ergonomic research on work-related health problems by innovative computer models with focus on the prevention of Musculoskeletal Disorders”.
References Bao M., Zhou P., Li Y-C., Guo R., M Xing. D. ISAR imaging of complexly moving targets based on product high-order phase function, Systems Engineering and Electronics, Vol. 33, No 5, 2011, P.1018 – 1022. DOI:10.3969/j. issn.1001-506X.2011.05.11. Bedzhev B., Tasheva Zh., Mutkov V. CDMA codes for the next generation mobile communication systems, XII ISTET 03, Warsaw, Poland, V.1, 2003, P.78-82. Deng B., Qin Y., Wang H., Li X. An efficient mathematical description of range models for high-order-motion targets in synthetic aperture radar, IEEE Radar Conference, Atlanta, GA, USA, 7-11 May 2012. DOI:10.1109/ RADAR.2012.6212102. Gromek D., Kulpa Krz., Samczyński P. Experimental Results of Passive SAR Imaging Using DVB-T Illuminators of Opportunity, IEEE Geoscience and Remote Sensing Letters, V. 13, No. 8, August 2016, P.1124 – 1128, DOI: 10.1109/LGRS.2016.2571901. Kurowska A. The preliminary survey of ship recognition algorithms using ISAR images, 2016 17th International Radar Symposium (IRS), Date of Conference: 10-12 May 2016, ISBN: 978-1-5090-2518-3, Electronic ISSN: 2155-5753. Lazarov A., Minchev Ch. ISAR geometry, signal model and image processing algorithms, IET Journal Radar, Sonar and Navigation, V. 11, No 9, 2017, P. 1425 – 1434, DOI: 10.1049/iet-rsn.2017.0081. Lazarov A. ISAR signal formation and image reconstruction as complex spatial transforms, Digital image processing, Ch. 2, InTech, 2011, P. 27-50. ISBN 978-953-307-801-4.
Inverse Synthetic Aperture Radars 325 Lazarov A. Barker’s ISAR signal formation and image reconstruction as spatial transforms, Journal of Radar Science and Technology, V. 6, No. 6, 2012, P. 567-573, 579, 2012. Lazarov A., Minchev Ch. ISAR complementary phase code signal model and image reconstruction technique, Elektrotechnika &Еlektronika E+E, No. 1-2, 2009, P. 3-11, ISSN 0861-4717. Liu L., Qi MS, Zhou F. A novel non-uniform rotational motion estimation and compensation method for maneuvering targets ISAR imaging utilizing particle swarm optimization, IEEE Sensors Journal, V.18, No.1, 2018, DOI:10.1109/ JSEN.2017.2766670. Qu Z., Qu F., Hou C, Jing F. Quadratic frequency modulation signals parameter estimation based on two-dimensional product modified parameterized chirp rate-quadratic chirp rate distribution, Sensors, V.18, No.5, 2018, P.1624-1638; https://doi.org/10.3390/s18051624. Roy D., Babu Pr., Tuli S. Sparse reconstruction-based thermal imaging for defect detection, IEEE Transactions on Instrumentation and Measurement, 14 January 2019, P.1-9. DOI: 10.1109/TIM.2018.2889364. Toumi A., Khenchaf A. Target recognition using IFFT and MUSIC ISAR images, 2016 2nd International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), 21-23 March 2016, Monastir, Tunisia. Toumi A., Khenchaf A. Corrélation optique pour l’identification de cibles radar, 12ème atelier sur la Fouille de Données Complexes (FDC), Extraction et Gestion des Connaissances (EGC 2015). Wang B., Xu S., Wu W., Hu P., Chen Z. Adaptive ISAR imaging of maneuvering targets based on a modified Fourier transform, Sensors 2018, V.18, No 5, 1370, https://doi.org/10.3390/s18051370. Wang H., Yang L. Application of vector analysis on study of illuminated area and Doppler characteristics of airborne pulse radar, EURASIP Journal on Advances in Signal Processing 2014, https://doi.org/10.1186/1687-6180-2014-114. Wang Y., Jiang Y. ISAR imaging of a ship target using product high-order matchedphase transform, IEEE Geoscience and Remote Sensing Letters, V. 6, No. 4, Oct. 2009, P. 658 – 661, DOI:10.1109/LGRS.2009.2013876. Weidong H., Xiaoyong D., Xingbin H. “Wavenumber-space analysis and imaging algorithm based on distributed multi-channel radars, IEEE Radar Conference, 26-30 May 2008, Rome, Italy, DOI:10.1109/RADAR.2008.4721079. Xiang T., Li Y., Liu L., Duan H. The radial velocity estimation based on digital spotlight technique, 2015 IEEE 5th Asia-Pacific Conference on Synthetic Aperture Radar (APSAR), 1-4 Sept. 2015, Singapore, DOI: 10.1109/ APSAR.2015.7306316. Yabiao Zh., Shiqiang Y., Ziyue T., Bingcheng Y. Research on high-resolution ship targets radar imaging at sea, 2007 1st Asian and Pacific Conference on Synthetic Aperture Radar, Date of Conference: 5-9 Nov. 2007, Huangshan, China, DOI:10.1109/APSAR.2007.4418668.
9 Remote Sensing Imagery Spatial Resolution Enhancement Sergey A. Stankevich*, Iryna O. Piestova and Mykola S. Lubskyi Scientific Centre for Aerospace Research of the Earth, NAS of Ukraine, Kiev, Ukraine
Abstract
Comprehensive estimation of Earth geosystems condition by remote sensing data requires synergetic interpretation of multisensor imagery obtained in different spectral bands with different spatial resolution. Multispectral imagery useful information, as distinct from Shannon’s capacity, is determined not by radiometric, but mainly by spatial resolution. At the moment, many methods for remote sensed imagery spatial resolution enhancement in segmentation, recognition, interpretation, and measurements are used to improve visual data analysis. The distortion in image radiometry is the primary disadvantage of such methods. Therefore, it is very important to provide the physical spatial resolution enhancement of multispectral imagery with radiometry preservation. In this chapter, we will focus on the general informativeness of remote sensing imagery and the associated equivalent spatial resolution of multiband images. Next, the novel technique for multispectral imagery resolution enhancement based on the reference spectra database engagement will be proposed. The radiometric signals that reallocate within image subpixels according to topological rules to enhance the overall image physical resolution will be described. Smart reallocation of land cover class patterns according to relationships and types of the nearest surrounding subpixels is a key point in this approach. Infrared remote sensing imagery has special features, so the longwave infrared remote sensed imagery resolution enhancement using superresolution of image pairs of the same scene in the frequency domain is developed. Partitioning of the frequency spectrum, connected with different physical internals of the land surface, is used to enhance spatial resolution with additional information. Lastly, the
*Corresponding author: [email protected] Iftikhar B. Abbasov (ed.) Recognition and Perception of Images: Fundamentals and Applications, (327–368) © 2021 Scrivener Publishing LLC
327
328 Recognition and Perception of Images issues and open challenges of actual spatial resolution evaluating of remote sensing images will be discussed. Keywords: Remote sensing imagery, multiband image, reflectance spectra, spatial resolution, informativeness, spatial-frequency analysis, subpixel
9.1 Introduction Earth observation data obtained using airborne and spaceborne imaging systems are increasingly used in a variety of economic, natural resource, and special remote sensing applications [Lillesand et al., 2015]. The ensured spatial resolution of aerospace imagery is the main measure of quality in almost all remote sensing applications. The spatial resolution increasing is the most stable trend in the hi-tech development of imaging systems for Earth observation. The main purpose of remote sensing imaging systems is to ensure reliable detection and recognition of objects on the land or sea surface, which in most cases have small-size details and low contrast [Wei et al., 2018]. For that reason, the actual resolution of remote sensing imagery is so important. In a modern interpretation, the concept of spatial resolution is essentially statistical, as its value is estimated by statistical methods with multiple measuring in test digital aerospace images [Descloux et al., 2019]. Practical experience evidences the direct connection between the spatial resolution of the imaging system and the probability of objects’ detecting and recognizing in the acquired images [Kononov, 1991]. However, a clearly expressed feature of aerospace imagery now is multiband. The joint analysis of several co-registered band images significantly improves the efficiency of remote sensing and can be used to detect objects that cannot be detected in any separate band image [Arablouei, 2018]. The issue of the spatial resolution of multiband imagery is rather complicated and should be considered in the context of the usefulness of the information contained in one.
9.2 Multiband Aerospace Imagery Informativeness Currently, the primary aerospace imagery informativeness estimation approach in remote sensing applications relies heavily on the objects’ detection probability. However, in the theory of information, the informativeness is usually called the volume or fraction of useful information in its entirety. The usefulness of the information contained in the remote
Remote Sensing Imagery 329 sensing image is determined by its value for a particular application. Therefore, the informativeness should be measured as the information amount in appropriate metric units. At the same time, the relationship between quantitative information estimates and probabilistic ones is clear [Kononov, 1991]. An adequate evaluation of informativeness allows obtaining the desired result by minimal means – with a significant reduction in the amount of information analyzed without loss of quality. Also, considering the informativeness can improve the information value of the result. Optimization of informativeness can be aimed at more informative spectral bands selecting or at new secondary images synthesizing – both linear and non-linear. Aerospace imagery informativeness relative to a particular remote sensing application is determined by the amount of information useful for correctly distinguishing the objects and backgrounds specific for this application. There is no fundamental difference between objects and backgrounds, but objects should be distinguished from each other and from backgrounds, while backgrounds should be distinguished from objects only. Information on the objects and backgrounds detection is contained in the spectral, energy, and spatial distributions of their optical signals. The amount of information determined by spectral distributions can be expressed by information divergence D. Both energy and spatial components can be determined using the equivalence principle [Stankevich & Gerda, 2020] – a return to the primary goal of image analysis, which is the objects detection. To do this, some of the equivalent values in terms of objects detecting probability should be found by statistical methods: the equivalent signal-to-noise ratio ψ (energy component) and the equivalent spatial resolution r (spatial component) [Stankevich et al., 2008]. The complete expression for the information amount C, taking into account the equivalence principle, will be [Stankevich, 2006b]:
C=
D log 2 (1 + ψ ) 4r 2
(9.1)
The equivalent signal-to-noise ratio in a multispectral image can be evaluated through the Bhattacharyya statistical metric [Goudail et al., 2004]. As for the equivalent spatial resolution, it can be found by spatialfrequency analysis of the equivalent optical transfer function (OTF) of a multispectral image [Stankevich, 2006a].
330 Recognition and Perception of Images
9.3 Equivalent Spatial Resolution of Multiband Aerospace Imagery The concept of equivalent spatial resolution was introduced in [Stankevich, 2006a] using a probabilistic approach to object detection in the image. The mechanism for determining equivalent spatial resolution is based on the analogy with the well-known spatial-frequency analysis of a one- dimensional image [Bracewell, 2003]. Spatial-frequency methods provide powerful and convenient tools for the analysis and evaluation of remote sensing imaging systems [Boreman, 2001]. Such methods have been widely used in remote sensing practice for decades. The path of the optical signal (from the source of radiation, the medium between the object and the imaging system until the image) is thought of as composed from separate serial optical links. The system converts the optical radiation distribution inside the object plane E0(x) into the distribution inside the image plane E(x). Transmitting properties of the link are controlled by a signal at its output when some standard type signal inputs. The most widespread is impulse response h(x) which is the system’s output due to single delta-function δ(x), which is +∞
∫ δ (x )dx ≡ 1
(9.2)
−∞
In the imaging system’s analysis, the h(x) is called the point spread function (PSF) and describes the distribution of the optical signal inside the image of the ideal point formed by the system. The total optical signal inside the whole output image is determined by the convolution of the input image with the system’s impulse response: +∞
E( x ) =
∫ E (x − ξ)h(ξ)dξ = E (x ) ⊗ h(x ) 0
0
(9.3)
−∞
If the optical signal E(x) is described by its spatial-frequency spectrum E(ν), where ν is the spatial frequency, +∞
E(v ) =
∫ E(x )exp(−2πiν x )dx
−∞
(9.4)
Remote Sensing Imagery 331 then the transfer properties of the imaging system are characterized by one’s OTF. The OTF H (v) is defined as the ratio of the complex spatialfrequency spectrum E(ν) of the signal at the output of the system to the complex spatial-frequency spectrum E0(ν) of the input signal:
H (v ) =
E(ν ) E0 (ν )
(9.5)
Since the distribution of the optical signal inside the image is governed by convolution (9.3), there is a clear interrelation between the OTF H(ν) and the PSF h(x): the OTF and the PSF are obtained from each other by direct and inverse Fourier transform: +∞
H (v ) =
∫
−∞
1 h( x )exp(−2π iν x )dx , h( x ) = 2π
+∞
∫ H(ν )exp(2πiν x )dν
−∞
(9.6)
From (9.5) it follows that the spatial-frequency spectrum (9.4) of the image is a product of the spatial-frequency spectrum E0(ν) of the input optical signal and the OTF H (ν) of the system. When several optical links are connected sequentially, then total OTF is determined by the equation m
H (v ) =
∏ H (ν ) j
j =1
(9.7)
which significantly facilitates the analysis of the entire imaging system. However, classical spatial-frequency methods have serious drawbacks, which are especially evident when commissioning the latest airborne and space-borne remote sensing systems. First, they are difficult to adapt for the analysis of multispectral and hyperspectral, that is, multidimensional imaging systems, which are the vast majority now in remote sensing. The application of the multidimensional Fourier transform does not solve the problem because it does not reduce the system’s evaluation metric down to the generally accepted universal one, but simply translates the problem of multidimensionality into the frequency domain.
332 Recognition and Perception of Images The spatial resolution was such a universal metric over a long time, but its involvement in multidimensional imaging systems analysis encounters certain difficulties. Many empirical methods for multispectral image fusion are used in practice [Pandit & Bhiwani, 2015]. Calculation of separate OTF’s in individual spectral bands is inefficient because it does not take into account the images’ intercorrelations between the bands. Secondly, to determine the spatial resolution of the imaging system, in addition to one’s OTF, the noise model is required, which is expressed by the threshold modulation [Stankevich, 2005]. The problem of input image multidimensionality reappears here. A common feature of listed complexities, in addition to multidimensionality, is the statistical nature of remote sensing image acquisition. It is possible to overcome the above-mentioned difficulties by going back to the primary goal of image analysis, namely, to objects detection. It should be noted that the ideal point blur’s phenomenon in the actual imaging system is closely associated with optical aberrations, deviations in photons’ registration by semiconductor sensor, and the photoelectrons’ transfer by electronic bus. All of these are quantum-mechanical, and therefore fundamentally probabilistic effects. In this context, the PSF should be considered as the quanta distribution within the image point, and quantum density quickly decreases from the center to edges. When an initial number of photons fixed and sufficiently large, the PSF acquires the meaning of the spatial distribution of probability density f(x) of the optical signal in the image produced by imaging system:
f(x) = const h(x) where const =
1 +∞
(9.8)
is the normalization multiplier, which is equal to
∫ h(x )dx
−∞
1 according to (9.2) if neglected by signal energy loss and gain. From the above follows the purely probabilistic nature of target detection in remote sensing images, which transcends the scope of the classical spatial-frequency analysis. This fact erodes the concept of spatial resolution and forces researchers to additionally involve other probabilistic models of target detection in the image [Althouse & Chang, 1995]; [Stankevich, 2004]; [Liwen et al., 2016]; [Li, 2018]. The statistical methods in digital image processing for the spatial resolution evaluation are very promising. At the same time, it is desirable not to lose valuable developments obtained in the spatial-frequency analysis
Remote Sensing Imagery 333 of imaging systems. A probabilistic transform (PT) is engaged for onedimensional panchromatic images [Popov & Stankevich, 2005]. PT calculates the probability of correct separation of the lower and upper semiplanes of edge spread function (ESF) for each pixel, whereas ESF is the system response to the ideal jump of a signal between image segments along a chosen direction. A probabilistic transform implicitly takes into account the effect of the introduced noise onto the image quality. One-dimensional PT can be extended easily to a multidimensional image case by calculating probability over multidimensional signal distributions [Stankevich & Sholonik, 2007]. Let a multidimensional ESF E(x) be determined along some direction in a multispectral image using PT or in a different way [Stankevich & Shklyar, 2005], as shown in Figure 9.1. If the probability distribution of any one segment – lower Plow or upper Phigh – is calculated along this direction, then the one-dimensional probability spread function Figure 9.2 is obtained. The spatial derivative of this function Figure 9.3 is a probability density distribution (9.8) in the image. Returning to the physical meaning of the equation (9.8), with the known assumptions the obtained function f(x) can be considered as equivalent PSF of multispectral image. Thus, the probabilistic approach allows reducing the multidimensional PSF of the imaging system down to an equivalent one-dimensional one without loss of descriptiveness and clarity of the estimates obtained. E(x) Band 2 200
150
100 Band 1 Band 3
50
0
0
1
2
3
4
5
6
Figure 9.1 Multidimensional ESF of multispectral image.
7
8
9
x
334 Recognition and Perception of Images P(x) 0.9
Phigh
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
Plow 0
1
2
3
4
5
6
7
8
9
x
8
9
x
Figure 9.2 Probability spread function of multispectral image. f(x) 0.9 0.8
f(x)=
0.7
∂ P(x) ∂x
0.6 0.5 0.4 0.3 0.2 0.1 0.0
0
1
2
3
4
5
6
7
Figure 9.3 Probability density distribution of a multispectral image.
Further application of Fourier transform (9.6) to an equivalent PSF allows to determine the equivalent OTF of the imaging system and then apply all known methods of spatial-frequency analysis – modulation transfer function (MTF) calculation, spatial resolution evaluation, the imaging systems’ synthesis, inverse filtering, different resolution band images fusion, etc.
Remote Sensing Imagery 335 In addition, the spatial resolution of the imaging system can be estimated directly by the equivalent PSF if to refer to the classic definition of resolution: this is the minimum distance between the two points in the image, which are still perceived as distinct. According to Rayleigh, this distance is determined between the signal maximum in the PSF and the diffraction first minimum. But the actual PSF of the imaging system in general are always different from the diffractional one due to many factors. In view of this, the probability remains a universal criterion of image points resolving. If two identical PSFs are displaced relative to each other by a distance r, the error probability ε can be found as r
ε (r ) =
∫ f (x − r ) ∩ f (x ) dx
(9.9)
0
which is illustrated in Figure 9.4. The described technique for equivalent spatial resolution estimating is a topical issue due to the wider application of multi- and hyperspectral aerospace imagery now. An equivalent OTF concept allows evaluating the transfer characteristics of multidimensional imaging systems with conventional spatial-frequency analysis a more complete and theoretically reasonable. f(x) 0.9
r
0.8 0.7 0.6 0.5 0.4 0.3
ε
0.2 0.1 0.0
0
1
2
3
4
5
6
7
8
9
Figure 9.4 The equivalent spatial resolution determining by equivalent PSF of the imaging system.
x
336 Recognition and Perception of Images
9.4 Multispectral Imagery Resolution Enhancement Based on Spectral Signatures’ Identification Physical spatial resolution for panchromatic or equivalent spatial resolution for multispectral aerospace imaging is the most important specification of remote sensing capabilities. The existing multispectral remote sensing systems acquire aerospace imagery in several spectral bandsets with the constant spatial resolution within the bandset and the integer-fold values of spatial resolution between different bandsets. Examples of such systems are Landsat-8 OLI, Sentinel-2 MSI, and others [Fu et al., 2020]. The resolution specifications of some modern remote sensing systems are listed in Table 9.1. This phenomenon is caused by fundamental modifications in the imager design in the transition from one spectral bandset to another, for example, from visible and near infrared (VNIR) to shortwave infrared (SWIR), mid-wave infrared (MWIR), and especially to longwave infrared (LWIR) or thermal infrared (TIR) due to quantum properties changes in registered radiation and photoelectronic sensor. The issue of the spatial resolution enhancement of aerospace imagery or the new high-resolution image synthesizing based on the joint processing of low-resolution images of the same scene has been discussed numerously [Ehlers, 2008]; [Thomas et al., 2008]; [Li et al., 2009]; [Fonseca et al., 2011]. The theoretical justification for such a possibility is the signal-to-noise ratio amplification by joint linear processing of signals with uncorrelated noise [Kay, 1993]. A variety of resampling algorithms or options of classic pan-sharpening are used for the spatial resolution adjustment in different spectral bands of multispectral aerospace imagery. While pan-sharpening the lowresolution multispectral aerospace image is superimposed on the high resolution panchromatic one of the same scene by a pre-defined sequence of formal arithmetic and logic operations. The disadvantages of pan- sharpening are heuristicity, basic operations performing in the low spatial resolution domain and neglecting of statistical and radiometric properties of multispectral aerospace imagery as well as physical characteristics of scene objects [Alparone et al., 2004]; [Wang et al., 2005]; [Xu, Zhang & Li, 2014]. It is possible to eliminate these limitations by involving additional information about objects inside the remote sensing scene.
MSS-P Geotone-L
China
China
China
Russia
Russia
Ukraine
Superview-1/4
Gaofen-2
Gaofen-4
Canopus-V
Resource-P
Sich-2M
MSU-8M
GFI
PMC-2
PMS-3
LISS-IV
India
Resourcesat-2
2.4 m panchromatic and 7.8 m multispectral (4 bands)
1 m panchromatic and 4 m multispectral (6 bands)
2.1 m panchromatic and 10.5 m multispectral (3 bands)
50 m VNIR (4 bands) and 400 m MWIR (1 band)
1 m panchromatic and 4 m multispectral (4 bands)
0.5 m panchromatic and 2 m multispectral (4 bands)
5.8 m VNIR (3 bands) and 23.5 m SWIR (4 bands)
1.5 m panchromatic and 6 meter multispectral (4 bands)
SPOT-6/7
NAOMI
EU
Sentinel-2
15 m panchromatic and 30 meter multispectral (8 bands)
OLI
France
USA
Landsat-8
15 m VNIR (3 bands), 30 m SWIR (6 bands) and 90 m TIR (5 bands)
ASTER
Spatial resolution, m (in spectral bands)
10 m multispectral (4 bands), 20 m multispectral (6 bands), 60 m multispectral (3 bands)
USA
EOS
Imager type
MSI
Country
Satellite system
Table 9.1 The spatial resolution of multispectral imaging systems.
Remote Sensing Imagery 337
338 Recognition and Perception of Images If there is an object with a known spectral reflectance ρ(λ) within the scene, it is possible to determine its reflectance value ρi in any i-th spectral band Δλi:
ρi =
1 | ∆λi |
∫ ρ( λ ) d λ
(9.10)
∆λi
According to the equation (9.10), it is possible to get an object spectral signature – the set of its reflectance values in all spectral bands. Spectral signatures are usually individual for almost all remote sensing objects, so it is possible to identify arbitrary objects in a multispectral image. A necessary condition for such identification is information about the object’s spectral reflectance, for example in the form of the spectral library (SpecLib) of typical objects [Tominaga, 2002]; [Boori et al., 2018]. Many well-known methods can be used for identification – based on crosscorrelation, using spectral diagrams, determining spectral similarity by the shape of reflectance curves, by spectral contrasts, based on the Hamming, Euclidean or Mahalanobis distance, using a variety of information metrics, etc. [Landgrebe, 2003]; [Lu & Weng, 2007]; [Salah, 2017]; [Scherer, 2020]. Upon identification finished, it becomes possible to predict spectral signatures for low-resolution spectral bands subpixels that correspond to high-resolution spectral bands pixels. In this way the complex nonlinear relationship between the optical reflectance of the same object in different spectral bands is commissioned, as shown in Figure 9.5: ρx = SpecID (ρ1, ρ1, ρ3); here SpecID() is the procedure for identifying the analyzed spectral signature in SpecLib. For subpixels of the high resolution output image the input pixel average values ρi must be under the constraint of radiometric non-distortion:
ρi =
1 k
k
∑ρ
ij
(9.11)
j =1
where k is the number of pixels in the output high-resolution image corresponding to one pixel in the input low resolution image. The constraint (9.11) necessitates the subpixels’ radiometrically proportional weighting within a low-resolution pixel. This constraint should be imposed over all spectral bands for which spatial resolution is enhanced. The general idea of correct spatial resolution enhancement of certain spectral bands of a multispectral aerospace image with the involvement
Remote Sensing Imagery 339 ρ(λ) 0.9 0.8 0.7 0.6 0.5 ρ3
0.4
ρx
0.3 0.2 0.1 0.0 300
ρ1
SpecID ρ2
600
900
1200
1500
1800
2100
2400
λ, HM
Figure 9.5 Determining the spectral reflectance value of x band using the identified spectrum in SpecLib.
of additional information on objects’ spectral reflectance is explained by Figure 9.6 flowchart. Each low-resolution pixel is matched to a certain number k of high- resolution pixels in a multispectral image. The radiometric calibration, atmospheric correction, and conversion into land surface reflectance are performed over the input multispectral image. Also, the SpecLib spectra are recalculated into spectral signatures of the imaging system according to the equation (9.10). The resampling of low-resolution bands is performed for corresponding (sub)pixels of high- and low-resolution bands. Next, the spectral signatures are identified in each high-resolution pixel with low-resolution subpixels by one of the known methods. The correction of radiometric values of low-resolution subpixels is performed by weighing under constraint (9.11). The formal spatial resolution of low-resolution bands becomes equal to a spatial resolution of high-resolution bands after correction. The described technique engages additional information about the spectral reflectance of scene objects into a multispectral image [Zhukov et al., 2014]. Thus, the overall information content of an enhanced resolution multispectral image is increased. The performance of the described technique is demonstrated by the spatial resolution enhancement of six SWIR bands of the ASTER multispectral image of EOS satellite system from 30 to 15 m. Figure 9.7 plots in unified scale the original color-synthesized ASTER image of VNIR and SWIR bands, as well as an enhanced resolution color-synthesized SWIR image.
340 Recognition and Perception of Images Input mix resolution multispectral image
Low resolution bands resampling
SpecLib
Imaging system specifications
Band signatures calculation
Band signatures identification
Low resolution subpixels weighting
Output end resolution multispectral image
Figure 9.6 Multispectral imagery resolution enhancement dataflow diagram.
(a)
(b)
(c)
Figure 9.7 ASTER multispectral satellite image (Odessa, Ukraine, 24.09.2007, 9 spectral bands) spatial resolution enhancement based on spectral signatures’ identification: (a) – false color original VNIR image (15 m resolution), (b) – false color original SWIR image (30 m resolution), (c) – false color enhanced SWIR image (up to 15 m resolution).
Remote Sensing Imagery 341 A spectral library with 14 selected reference spectra of vegetation, soil, water surface and man-made covers was used for scene objects identification. After processing, some improvement in details of the Figure 9.7 c image are recognizable in comparison with the Figure 9.7 b one. In particular, the peak signal-to-noise ratio (PSNR) increased from 20.397 to 21.202, and the contrast texture parameter of grey level co-occurrence matrix (GLCM) increased from 674.0 up to 697.333, which indirectly confirms the spatial resolution enhancement [Aghav & Narkhede, 2017].
9.5 Multispectral Imagery Resolution Enhancement Using Subpixels Values Reallocation According to Land Cover Classes’ Topology To confirm the relevance of the research topic, the following three methods of formal resolution conversion were chosen for comparison [Piestova, 2018]. The first one is a simple division of the pixel into subpixels in accordance with the rule of the nearest neighbor, while the values of the original pixel are stored in the subpixels [Rukundo & Cao, 2012]. The second method for calculating the values of the subpixels applies the selected type of interpolation based on a certain number of adjacent pixels of the original image [Keys, 1981]. And the third is the superresolution method for multiresolution images with band- independent multispectral pixels geometry (Sen2Res) [Brodu, 2017]. The nearest neighbor method compared to the reference image Figure 9.8a preserves the radiometric values of subpixels in a pixel but does not change the information content. The bicubic interpolation method considered in the example Figure 9.8 b slightly smoothes the image, which gives
(a)
(b)
(c)
Figure 9.8 The results of the satellite imagery band downsampling by a factor of two: (a) – nearest neighbor resampling, (b) – bicubic interpolation, (c) – Sen2Res method.
342 Recognition and Perception of Images the feeling of a more detailed image. Nevertheless, this procedure does not save radiometry within the original pixel. The average value of the subpixels in a pixel is different from the original value of the input image. The Sen2Res method (Figure 9.8c) quite successfully performs the tasks. The radiometry in the pixel is preserved, and the image spatial enhancement occurs. But the Sen2Res method explicitly encodes geometric details from the available high-resolution bands and saves the spectral content of each low-resolution band, regardless of the geometry of the land surface types in that band. It is proposed to use a method [Piestova et al., 2017] based on the image segmentation by spectral signatures [Shapiro & Stockman, 2001]; [Lauren & Lee, 2003] and reference spectra database engagement, decision-making optimization to subpixels values reallocation. Both the shape of the spectral signatures and the spatial topology relationships for each type of land cover are taking into account. Therefore, the process of determining the types of the Earth’s surface according to their spectral characteristics in the studied area is necessary. Fuzzy logic methods are widely used to solve these problems [Pedrycz, 2005]. Fuzzy clustering of images [Hoppner et al., 2000] is based on the use of reference spectra [Zaitseva et al., 2018]. They are collected both from publicly accessible spectral libraries in accordance with the physical and geographical characteristics of the region under study [Stankevich & Shklyar, 2006] and can be formed by the expert method from the data of the original image [Piestova et al., 2017]. The spectral signatures of the reference spectra undergo the convolution procedure according to the equation (9.11), and the methodology described above. For proper spatial redistribution of subpixels as well as decision-making, it is necessary to analyze the topological properties of classes and obtain the class proximity matrix. Unlike Boolean logic which considers binary truth level, multiple valued logic (MVL) is a type of logic in which the truth level can be m-valued or infinite [Miller & Thornton, 2007]. MVL was first developed due to solve the paradox of the truth of future claims, but eventually, MVL has found its way into such fields as data mining [Wang & Ling, 2012], reliability engineering [Zaitseva et al., 2012]; [Zaitseva & Levashenko, 2017], artificial logic [Moraga et al., 2003] or electrical engineering [Yanushkevich et al., 2005]. In MVL, a mathematical approach is needed to express relationships between input logical values and the result of certain phenomena. The m-valued logic function fm(x1, x2, ... , xn) can be used, which has the following form:
Remote Sensing Imagery 343
fm :
n
→
where n is number of multiple-valued variables and set = {0,1,…, m − 1} is a set of defined levels of truth [Miller & Thornton, 2007]. We chose to use MVL for Reclassification of surface types depending on their spatial distribution, which is applied in our remote sensing data analysis because this approach will allow sufficient and effective reclassification. The significant step of data analysis is pattern finding because it can be used to make a decision, if a selected pixel has to change its color, stay on actual color or it needs expert opinion for color changing according to the number of finding patterns. This step is carried out at the matrix that consists of 9 subpixels, where the analyzed pixel is in the middle of it (Figure 9.9). Five significant types of the relative location of subpixels of one class are defined and their appearance is illustrated in Figure 9.10. Those five types represent the specific shapes of subpixels. All columns and rows – T5 (even those that don’t go through the middle subpixel), diagonals (T2) and compact groups of three subpixels with a specific arrangement (T1 – right angle on the edges, T3 – right angle in the middle of the outer columns and rows in the matrix, T4 – right angle in the middle of the matrix). The 1
2
3
4
5
6
7
8
9
m
n
Figure 9.9 Schematic representation of the analyzed subpixels on pixel grids of two images with spatial resolution m and n, having a multiplicity of two.
x1 x5
x4 x7
x1
T1
x5
T2
x4
x2
x5 x8
x9
x8
x1
x2
T3
Figure 9.10 Basic types of subpixels relative location.
T4
T5
x3
344 Recognition and Perception of Images number of occurrences of such types can be defined as input variables for a 3-valued logic function that will produce a deciding result for a central subpixel of the matrix about the color changing. According to this function, the central pixel is changed if the function value is 2 and is not changed if the function value is 0. The changing of the central pixel class needs the additional analysis if the value of this function is 1. The structure functions Tj for class changing of the central pixel for each basic type according to the series-parallel systems (Figure 9.11) are presented as:
T1(x) = x5 ∧ ((x1 ∧ x9) ∨ (x3 ∧ x7))
(9.12)
T2(x) = (x1 ∧ x4 ∧ x7) ∨ (x2 ∧ x5 ∧ x8) ∨ (x3 ∧ x6 ∧ x9) ∨
(x1 ∧ x2 ∧ x3) ∨ (x4 ∧ x5 ∧ x6) ∨ (x7 ∧ x8 x9)
(9.13)
T3(x) = (x2 ∧ ((x1 ∧ x4) ∨ (x3 ∧ x6))) ∨ (x8 ∧ ((x4 ∧ x7) ∨ (x6 ∧ x9)))
(9.14)
T4(x) = x5 ∧ ((x2 ∨ x8) ∧ (x4 ∨ x6))
T1
X1
X9
X3
X7
X1
X4
X3
X6
X4
X7
X6
X9
X5
X2 T3 X8
Figure 9.11 The series-parallel system examples.
(9.15)
Remote Sensing Imagery 345
T5(x) = x5 ∧ ((x1 ∧ (x2 ∨ x4)) ∨ (x3 ∧ (x2 ∨ x6)) ∨ (x7 ∧ (x4 ∨ x8)) ∨ (x9 ∧ (x6 ∨ x8))) (9.16)
where xi denotes the presence of the i class under study in each subpixel. The values of types of subpixels relative location agree with the possible number of this type of location in the considered block. The pixel redistribution procedure takes into account the following topological properties of the analyzed segment. Compactness – when compact inextricable groups localize subpixels of one class. Linearity – the subpixels of one class are located linearly: horizontally, vertically, diagonally, and the orientation of these lines can be determined. Texture – subpixels of one or several classes are arranged in a checkerboard pattern or close to it, and when several different classes fall into the window and compact or linear structures are not defined. The analyzed pixel is central in the matrix, which consists of subpixels of the 3×3 sliding window. A sequential check of the classification in the scanning window makes it possible to make the following decisions: 0 – the class of surrounding subpixels has no effect; 1 – there is the possibility of changing the class, which must be checked using the class similarity matrix; 2 – the center subpixel class will change to the class of surrounding subpixels [Stankevich et al., 2019]. At the diagram in Figure 9.12 the logical function is described that is used to make a decision on substitution of the class of the central subpixel. The presence of various types of the dependent spatial distribution of subpixels of the same class is checked in the scanning window. Test processing of a multispectral satellite image Sentinel-2 (27.08.2016, the territory of Kyiv, Ukraine) was carried out according to the method described. Four reference spectral bands with a spatial resolution of 10 meters were involved. Six spectral bands with a spatial resolution of 20 meters underwent a resolution enhancement procedure. The reference spectra were converted into spectral signatures with ten values. Logical clustering was carried out on the main types of land cover in the studied region. Figure 9.13 shows an illustration of a resolution improvement technique that indicates approx. 16% spatial resolution enhancement. Spatial topology consideration is especially efficient for certain land cover types and it is capable of providing significant superiority in resolution enhancement over the average one.
346 Recognition and Perception of Images 0
1
T4 2..4
3..4 2
T3
0
1
T2 1..3
2
T2 1..2
0
T2 0..1,3
0
0,2,4
T5
0,3
T5
1
3
2 1
1
T1
0
0
2
Figure 9.12 Selection of subpixel reallocating rule.
(a)
(b)
Figure 9.13 Sentinel 2 MSI fragment with false color combination of band 5 (705 nm), band 6 (740 nm), and band 11 (1610 nm): (a) – 20 m resolution input, (b) – after resolution enhancement applying.
9.6 Remote Sensing Longwave Infrared Data Spatial Resolution Enhancement The radiance in wavelength range 3–14 μm corresponds to heat emission; thus this part of the electromagnetic spectrum range is called LWIR or
Remote Sensing Imagery 347 TIR [Kuenzer & Dech, 2013]. Thermal emission is caused by the excitation of substance particles during collisions in the process of thermal motion or accelerated charge motion (oscillations of the crystal structure ions, thermal motion of free electrons, etc.). It occurs at all temperatures and is inherent in all bodies. A peculiarity of thermal radiance is the continuous spectrum. The intensity of the radiation and the spectral composition depends on the body temperature and its emissivity, which are the common surface features, determined using remote sensing systems [Becker & Li, 1995]. For the purpose of the object’s surface temperature determination the inverse Planck law for “gray body” through the expression of LWIR radiance [Tang & Li, 2014]:
T=
c2 ε (λ ) c λ ln 5 1 + 1 λ Ls
(9.17)
where Ls is the spectral radiance of the land surface in the LWIR spectral band, ε(λ) is the spectral emissivity, c1 = 2hc2 = 1.191·10-16 W·m2, c2 = 1,439·10–2 m·K are the first and second Planck’s law constants, h = 6,626·10-34 J·s is the Planck constant, c = 2,998·108 m/s is the light speed in vacuum, λ is the radiance wavelength. Earth surface radiance data is provided by different remote sensing data means that can be divided into four types: space-borne remote sensing systems (ASTER, TM, ETM +, MODIS, TIRS satellite sensors); airborne systems (TABI-1800, TASI-600); compact cameras mounted on unmanned aerial vehicles (FLIR systems line); handled infrared systems (pyrometers, thermal imagers). Depending on the scale, all of these systems are widely applied to estimate surface temperature and/or emissivity. Space-borne and airborne remote sensing data most commonly represented as georeferenced multispectral (multiband data) or panchromatic (one-band data) imagery in GeoTIFF format [Qu et al., 2007]. Currently operating and providing renewing data space-borne remote sensing systems include the following: TIRS (Landsat-8 satellite), AVHRR (NOAA satellites), MODIS (TERRA), ECOSTRESS sensor onboard International Space Station. TM, ETM+ (currently non-functioning Landsat-4, 5, 7), ASTER, BIRD satellite data, and more can be acquired as an archive.
348 Recognition and Perception of Images Considering satellite remote sensing data, the most commonly used data were those obtained by MODIS sensor onboard EOS series remote sensing satellites, commissioned in 2000-2002. This sensor provides 16-bands VNIR, SWIR, and TIR data. Within the 3-14 μm wavelength range, MODIS acquires data with a spatial resolution of 1 km with a revisit period of 8 days. The products, related to infrared data processing are MOD11 (land surface temperature and emissivity) and MOD14 (thermal anomalies and fires). Also, based on MODIS data, surface water temperatures are mapped using shortwave (3.9-4 μm) and longwave (11-12 μm) infrared data. MODIS space imagery is also used in the Fire Information for Resource Management System (FIRMS), which combines MOD14 with VIIRS visible and near-infrared sensor data onboard the Suomi National Polar-orbiting Partnership (Suomi-NPP) satellite retaining the revisit period of 12 hours provide information on the available fires with a spatial resolution of 375 m [Schroeder, 2015]. Infrared remote sensing became useful in many scientific and applied kinds of research. Such systems are actively used to monitor Earth’s surface heat radiance, to detect forest fires, in particular EOS, BIRD, NOAA, Himawari-8 satellites [Na et al., 2018]; [Calle et al., 2005]. Landsat series satellites are utilized for long-term volcano activity monitoring [Andres & Rose, 1995]; [Mia et al., 2017], peat fires monitoring [Elvidge et al., 2015]. Satellite remote sensing data has some application restrictions: inability to gather surface data through the clouds, revisit frequency that is proportional to the scanning bandwidth, atmospheric spectral transparency windows (Figure 9.14), and atmosphere gas composition, which bring the necessity of input radiance data atmospheric correction.
Atmospheric transmittance, %
100 80 60 40 20 0
0
1
2
3
4
5 6 7 8 9 10 Radiance wavelength, µm
11
12
13
14
15
Figure 9.14 Thermal infrared atmospheric windows (3-5 μm – mid-wave infrared range, 8-14 μm – longwave infrared range).
Remote Sensing Imagery 349 For the most efficient and precise land surface temperature (LST) estimation, it is necessary to consider the influence of the atmospheric gas composition on solar radiance, since the atmosphere contains a large number of gas and aerosol fractions capable to absorb, scatter and reflect electromagnetic radiation depending on its wavelengths. The generalized atmospheric model used to correct the atmosphere influence on radiance requires the determination of three parameters [Sobrino et al., 2004]. Longwave spectral upwelling irradiance: the rate at which radiant energy, at a wavelength, that corresponds to sensors spectral response, is being emitted upwards into a radiation field and transferred across a surface area (real or imaginary) in a hemisphere of directions. Broadband downwelling irradiance: the total diffuse and direct radiant energy, at wavelengths that correspond to sensors spectral response that is being emitted downwards. Broadband atmospheric spectral transmittance at wavelengths that correspond to sensors spectral response. All these parameters determine the equation of radiation transfer in the atmosphere:
Lλ = τ (λ ) ⋅ ε (λ )Ls + (1 − ε (λ ))L↓λ + L↑λ
(9.18)
where Lλ is the spectral radiance of the upper boundary of the atmosphere ↑ in the spectral band λ, Li is the spectral upwelling irradiance in the spec↓ tral band λ, Li is the longwave broadband downwelling irradiance in the spectral band λ, τ(λ) is the atmospheric spectral transmittance in the spectral band λ, ε(λ) spectral emissivity in the spectral range λ. According to equation (9.18) at-surface spectral radiance expressed as:
Ls =
Lλ − L↑λ 1 − ε (λ ) ↓ − Lλ ε (λ )τ (λ ) ε (λ )
(9.19)
Typically resulting temperature/emissivity images suffer from lack of details because of lower spatial resolution relative to remote sensing data in VNIR spectral bands. This complexity appears from physics restrictions caused by reducing energy of quants with wavelength increasing. Freeexcess LWIR radiance, emissivity, and LST data are provided with spatial resolution nearly 1 km (EOS, NOAA satellites, etc.). For detailed mapping of land surface thermodynamic features, this data is unsuitable because of lack of resolution and it can be used in
350 Recognition and Perception of Images regional and global scale for purposes described above. For all remote sensing history several satellite systems provided LWIR radiance data data with a spatial resolution less than 100 m: ASTER with five-band LWIR senor of 90 m spatial resolution, and Landsat series satellites’ LWIR sensors that provide 1-2 bands data with 60-100 m spatial resolution starting from 1984. In June 2013, after 29 years of continuous operation, Landsat-5 satellite was disabled, thus Landsat-8 with two-band TIRS sensor with 100 m spatial resolution, interpolated down to 30 m, remains the only satellite for this time that provides free-access LWIR imagery of intermediate spatial resolution, appropriate for relatively detailed LST mapping. Considering VNIR radiance, it has much more higher quantum energy and following from this higher capability for imagers senility and providing higher spatial resolution. Current Sentinel-2 satellite operated by the European Space Agency (ESA) also provides free-access data with spatial resolution up to 10 m. Comparison between 10 m visible blue band Sentinel-2 image and 100 m Landsat-8 TIRS data (10 µm longwave infrared band) is given in Figure 9.15. For research of territories with high land surface type’s heterogeneity, like urbanized areas, the Landsat-8 data require spatial resolution enhancement to provide consideration of the influence of terrain details that are below the spatial resolution. There are many techniques of imagery spatial resolution enhancement. The newest approaches consider combining two main data processing strategies: fusing the coarse spatial resolution LST with its fine resolution affecting factors using empirical relationship extracted by statistical algorithms, while the second strategy is to enhance the spatial resolution of the retrieving elements of LST (e.g., thermal radiance, atmospheric profiles). Such a framework enhances the spatial resolution of the retrieving elements first by superresolution mapping (SRM) and superresolution reconstruction (SRR) and then derives the LST based on the resolution enhanced elements [Feng & Li, 2020]. Another approach considers obtaining a high spatial resolution image from two or more images of lower spatial resolution, that are shifted relative to each other less than on one pixel using subpixel co-registration. Images with subpixel shift can be formed simultaneously (e.g., using image pairs from different bands) or capturing images in different periods [Lyalko et al., 2014]; [Stankevich et al., 2016]. This approach also can be combined with emissivity data of high spatial resolution included into the temperature estimation equation (9.17). The main idea of emissivity data application for spatial resolution enhancement consists in the possibility of emissivity
Remote Sensing Imagery 351
(a)
(b)
Figure 9.15 Raw satellite images of central part of Kyiv city (Ukraine): 10 m Sentinel-2 blue band (a) 100 m Landsat-8 TIRS longwave infrared band (b).
extracting from VNIR data, which typically has a higher spatial resolution as mentioned earlier [Valor & Caselles, 1996]; [Jiménez-Muñoz et al., 2006]. VNIR data is appropriate for the emissivity mapping task using normalized difference vegetation index (NDVI) [Kaspersen et al., 2015]. This technique allows separating the sparse vegetation’s and bare soil’s emissivity and emissivity of other covers and surfaces like sands, water, artificial surfaces, and others. NDVI index is commonly used for this purpose. NDVI thresholds perform the separation of vegetation from the open soil and other non-vegetation surfaces. The NDVI values between the thresholds NDVI0 for bare soil and NDVI1 for the most sparse vegetation cover corresponds to projective vegetation cover F within the image pixel and represents as:
352 Recognition and Perception of Images 2
NDVI − NDVI0 F ≅ NDVI1 − NDVI0
(9.20)
The relation expresses spectral emissivity:
ε(λ) = ε1(λ)·F + ε0(λ)·(1 – F) + Δε(λ)
(9.21)
where ε1(λ) and ε0(λ) are the emissivities of vegetation and open surface correspondingly, Δε(λ) – correction caused by terrain roughness. For the common LWIR satellite data the following values are used: ε1(λ) ≈ 0.985, Δε(λ) ≈ 0.005. The value of ε0(λ) depends on the type of vegetation-free land cover [Jiménez-Muñoz et al., 2006]. The NDVI-emissivity relationship for other surfaces and covers is estimated on the basis of the regressive dependence between artificial surfaces spectra taken from ASTER Spectral Library, reorganized into ECOSTRESS Spectral Library in 2018 (http://speclib.jpl.nasa.gov) and NDVI beyond vegetation thresholds, considered in equation (9.21). Quasi-optimal spline-approximation of the NDVI-emissivity dependence performs the emissivity distribution for non-vegetated areas (Figure 9.16). This approach of emissivity estimation allows obtaining sufficiently detailed data of the thermodynamic features of the landscapes, presented by satellite data and allows improving the informativeness of the resulting LST distribution relatively to the raw satellite LWIR data (Figure 9.17).
1.25
Emissivity of vegetation cover and bare soil
1.2
Emissivity
1.15 1.1 Emissivity of the rest surfaces
1.05 1 0.95 0.9
-1
-0.8 -0.6 -0.4 -0.2
0 0.2 NDVI
0.4
0.6
0.8
1
Quadratic emissivity/NDVI dependence
Figure 9.16 NDVI-emissivity relationship approximation [Zaitseva et al., 2019].
Remote Sensing Imagery 353
(a)
(b)
Figure 9.17 Visual comparison between the resolution of raw longwave-infrared data Landsat-8 TIRS (a) and enhanced LST (b).
Further spatial resolution enhancement can be reached by joint processing of images pair of the same territory, captured simultaneously. The idea of this approach is extracting particular details from both images, that appear due to subpixel shift between them (images shifted relatively each other less than on one pixel) and merging into one image of enhanced spatial resolution. The technique for satellite imagery spatial resolution enhancement using a subpixel shift requires the transition from pixel grid of low spatial resolution images to a joint subpixel grid of enhanced spatial resolution inside the initial images field of view. Then the value in pixels of enhanced resolution image is estimated [Stankevich et al., 2016]. Subpixel shift between the low-resolution images must be either known (in adopted for subpixel imaging registration) or estimated. Reference segments with high-contrast edges within the image are required for subpixel shift estimation. Under the linear affine transformation assumption, the shift value estimation can be performed using the discrete Fourier transform (DFT) in the frequency domain [Stone et al., 2001], which allows deriving displacement values and its projections on the horizontal and vertical axes (Figure 9.18). Further subpixel processing includes next operations [Lyalko et al., 2014]: deriving of the joint noise matrix represented as the difference between input images of low spatial resolution; merging of low-resolution input images into a resampled interlaced image onto a high resolution grid; estimating the subpixel shift; applying the inverse matrix operator for the rest of the pixels restoration; irregularities and noise elimination using iterative image reconstruction. Enhanced spatial resolution data restoration technique envisages subpixels values estimation in the image xij , i = 1 .. nx+p−1, j = 1 .. ny+q−1 using the known yij values of low-resolution image pixels with 1 p and 1 q shift fraction in both directions. Pixel values are expressed through the subpixel shift equation system:
354 Recognition and Perception of Images Δx
Δy
Sub-pixel shift estimation
Resampling
Spatial resolution enhancement
Figure 9.18 Spatial resolution enhancement flowchart. i + p−1 j +q−1
yij =
∑∑x k =1
kl
, i = 1,, nx , j = 1,, n y
l =1
(9.22)
This is the system of nx×ny equations with (nx+p–1)×(ny+q–1) unknowns. A solution of the system (9.22) can be estimated by recurrent equation: i −1
xij = y(i− p+1)( j −q+1) −
j −1
∑ ∑x − ∑ x − ∑ x kl
k =i − p+1 j −q+1
j −1
i −1
kj
k =i − p+1
i = p..nx + p − 1 , j = q..n y + q −1
il
(9.23)
l = j −q+1
The system of equations (9.22) can be expressed in the matrix form:
T1 × X × T2 = Y
(9.24)
where X = {xij} is the matrix of the unknowns with (nx+p–1)×(ny+q–1) dimensions, Y = {yij} is the matrix of observations with nx×ny dimensions, T1 is the p-diagonal matrix of size nx×(nx+p–1) with value 1 on the diagonals, T2 is the q-diagonal matrix with ny×(ny+q–1) dimensions with value 1 on the diagonals. Generalized matrix representation of the enhanced resolution image can be expressed as:
X = R × Y + ϕ
(9.25)
where X is the image of enhanced spatial resolution expanded as a vector, Y are the low spatial resolution images expanded as vectors, R is the matrix
Remote Sensing Imagery 355 resampling operator, ϕ is the errors vector. Linear estimation of enhanced spatial resolution image X can be expressed as:
X = cov X × RT × (R × cov X × RT + cov ϕ)−1 × (Y – Ym) + Xm (9.26) where cov X and cov ϕ are a priori covariance matrices, that represent the image of enhanced spatial resolution and error matrix, Xm and Ym are expected values of pixels of enhanced image and input images, respectively [Lyalko et al., 2014]. Emissivity data, obtained using Landsat-8 VNIR imagery, are appropriate data for subpixel processing. However, this technique can be adopted for temperature image pairs processing, despite the significant temperature difference in the case of deriving data with the high time difference. For this purpose temperature images require specific processing in the frequency domain, that permit extracting particular frequency components from images, which can be utilized instead of emissivity images for subpixel restoration. This approach is realized using DFT, both as subpixel shift estimation. DFT allows transferring data from the spatial domain into the frequency domain. Amplitude spectrum as the main output data gives detailed information about frequency components, contained in input temperature image and its direction across the image (Figure 9.19). DFT is the equivalent of the continuous Fourier transform for signals known only at discrete instants separated by rectangular matrix unit
(a)
(b)
Figure 9.19 LST image in the spatial domain (a) and its Fourier amplitude spectrum (b).
356 Recognition and Perception of Images (in case of 2-D spatial domain data distribution, like digital imagery). Discrete separation of DFT data corresponds to initial data dimensions. The two-dimensional DFT or image data with L×N resolution can be expressed as [Briggs & Henson, 1995]: L −1 N −1
F (u, v ) =
∑∑ i =0
f (i , j )e
ui vj − i 2 π + L N
(9.27)
j =0
where f(i, j) is the image of L×N size in the spatial domain and the exponential term is the basis function corresponding to each point of F(u, v) in the frequency domain. The value of each point of F(u, v) is obtained by multiplying the spatial image with the corresponding base function and summing the result. LST imagery, as well as all digital images in the frequency domain, consists of different spatial frequency components. Longwave components are concentrated in the central part of the amplitude spectrum and correspond to extensive homogenous fields, like lakes, agriculture fields, forests, rivers, etc. Increment of the spatial frequency and its transition closer to borders of the amplitude spectrum corresponds to the decrement of the distance between surfaces with different temperatures. The highest frequencies correspond to the edge between different surfaces represented by low- frequency component. − i 2π ui + vi In equation (9.27) e L N is the kernel of DFT and commonly it expressed as e−i2πωx, where ω is the frequency of the x component. Using the Euler formula this kernel may be expressed as [Briggs & Henson, 1995]:
e±i2πωx = cos(2πωx) ± i sin(2πωx)
(9.28)
Fourier transform kernel contains complex values, that can be separated into real R(u, v) and imaginary I(u, v) matrices, then the amplitude spectrum can be expressed as follows:
A(u, v ) = R 2 (u, v ) + I 2 (u, v )
(9.29)
LST images in the frequency domain can be subdivided into two specific frequency components, which were extracted according to the following logic (Figurre 9.20): the low-resolution component corresponds to inertial and smooth heat radiance field, whereas in contrast, the
Remote Sensing Imagery 357
Low-frequency component
High-frequency component
Figure 9.20 Extraction of low- and high-frequency components of LST data.
high-frequency component corresponds to high difference on the edges of different surfaces. Using the hypothesis that strong edges between different land surface objects remain changeless for a long time period, imagery pairs of high-frequency components distribution becomes the appropriate data for the superresolution technique described above instead of emissivity data, because of its steadiness despite the time difference. Separating real Temperature image 1
Hi 1
Lo 1
Temperature image 2
Hi 2
Superresolution
Temperature image 1 of enhanced spatial resolution
Lo 2
Temperature image 2 of enhanced spatial resolution
Figure 9.21 General flowchart of frequency component merging for LST spatial resolution enhancement.
358 Recognition and Perception of Images
(a)
(b)
Figure 9.22 The visual difference between low-resolution LST images (a), and enhanced LST images after high-frequency component subpixel superresolution (b).
Remote Sensing Imagery 359 and imaginary parts of the DFT amplitude spectrum allow representing and processing it like regular imagery data. After the estimation of highfrequency components distribution data, it merges with low-frequency components of low spatial resolution into the only LST image of high spatial resolution (Figure 9.21). Inverse Fast Fourier Transform (IFFT) performs the reverse transition from the frequency domain into a spatial one. Practically, the frequency separation technique for already processed LST images allows to use LST excluding season and time period conformity data selection, because of described high-frequency components’ features, which makes it more flexible and efficient relatively to emissivity images resolution enhancement. Described approaches for LST imagery resolution enhancement combine into a rather efficient two-stage technique for LST imagery resolution enhancement, that requires any additional hardware means and sensing systems don’t require any technical approaches. This technique can be adopted for data acquired by any remote sensing systems: satellite sensors, airborne cameras, etc. Figure 9.22 represents the visual difference of LST images fragments before and after the subpixel technique superresolution processing. Results of spatial resolution enhancement will be quite useful in longterm urbanized areas development and expanding analysis, including its effect on public health and environment: ecosystems, natural landscapes and water bodies, and other remote sensing applications that require detailed Earth’s surface thermal monitoring.
9.7 Issues of Objective Evaluation of Remote Sensing Imagery Actual Spatial Resolution All results considering spatial resolution enhancement of remote sensing imagery need to be evaluated and verified. Commonly the actual spatial resolution of remote sensing systems is determined either by land surface resolving power test targets or by the boundary curves of digital images [Anikeeva & Kadnichanskiy, 2017]. These scientifically approved approaches allow determining the actual spatial resolution, i.e., the size of the smallest detail that remains distinguishable on the image [Chen et al., 2018]. However, many studies that consider digital imagery processing apply incorrect approaches, e.g., requiring a reference image [Dadras Javan et al., 2019]. Besides, often a variety of statistical metrics for digital images are included [Jagalingam & Hegde, 2015], that are not related to spatial
360 Recognition and Perception of Images resolution and in the ideal case describing noise. For example, metrics like structural similarity index (SSIM) [Zhou et al., 2004], will determinate an image quality decrement due to the appearance of new small details, which is justified for evaluating image compression algorithms and directly contradicts the sense of spatial resolution enhancement. In our opinion, the classic spatial-frequency analysis remains the most appropriate tool for remote sensing images actual spatial resolution assessment, and the MTF remains the main criterion [Eremeyev et al., 2009]. The theoretical MTF of digital optoelectronic systems in the spatial frequency domain is limited by the threshold determined by the sampling element size. As the spatial frequency increases, the modulation decreases, and hence the signal-to-noise ratio. According to the information theory, this relation is the universal criterion for the potential resolution assessment [Becker & Haala, 2005]. It is very important to combine the methods of classical deterministic spatial-frequency analysis with advanced statistical techniques of digital imagery evaluation. The latter is able to independently determine digital imagery spatial resolution by restoring its PSF based on auto-covariance properties [Grabski, 2008]. However, the development of reliable techniques for remote sensing data spatial resolution evaluation is a rather complicated independent problem that goes beyond the scope of this study. In the meantime, it is necessary to apply visual and statistical estimates.
9.8 Conclusion Thus, on the basis of the general informativeness concept of multispectral remote sensing imagery, a toolset of techniques for spatial resolution enhancement has been developed. All developed techniques share a common trait, namely, the imputation of some additional information into a multispectral image, taking into account the physical characteristics of the land surface objects and classes. For multispectral imagery, the spatial resolution equalization in different spectral bands is performed involving both objects’ spectral reflectance and topological properties of the land cover types inside the scene. As for infrared imagery, the proposed longwave-infrared imagery resolution enhancing technique includes two-step processing: highresolution emissivity extraction and temperature imagery pair’s frequency domain processing for subpixel superresolution. This approach allows the
Remote Sensing Imagery 361 spatial resolution of resulting land surface temperature data to be highly increased. This approach is especially relevant for Landsat-legacy infrared imagery processing for detailed long-term land surface mapping and change detection. All presented techniques will be useful in various remote sensing applications as an indispensable step in the preliminary processing of multispectral imagery with inhomogeneous spatial resolution bands. Further research may be focused on the logics and algorithms improvement, as well as engaging the additional physical-based land surface features depending on the type of remote sensing imagery and study area. In addition, the development and implementation of methods for objective quantitative evaluation of remote sensing imagery spatial resolution and informativeness are very important.
References Aghav A.S., Narkhede N.S. Application-oriented approach to texture feature extraction using grey level co-occurrence matrix (GLCM). International Research Journal of Engineering and Technology, 4, 5, P.3498-3503, 2017. Alparone L., Baronti S., Garzelli A., Nencini F. A global quality measurement of pan-sharpened multispectral imagery. IEEE Geoscience and Remote Sensing Letters, 1, 4, P. 313-317, 2004. DOI: 10.1109/LGRS.2004.836784 Althouse M.L.G., Chang C.-I. Target detection in multispectral images using the spectral co-occurrence matrix and entropy thresholding. Optical Engineering, 34, 7, P.2135-2148, 1995. DOI: 10.1117/12.206579 Andres R.J., Rose W.L. Description of thermal anomalies on 2 active Guatemalan volcanoes using Landsat Thematic Mapper image. Photogrammetric Engineering & Remote Sensing, 61, 6, P.775-778, 1995. Anikeeva I.A., Kadnichanskiy S.A. Evaluation of the actual resolution of digital aerial and satellite imagery using an edge profile curve (in Russian). Geodesy and Cartography, 78, 6, P.25-36, 2017. DOI: 10.22389/0016-7126-2017-9246-25-36 Arablouei R. Fusing multiple multiband images. Journal of Imaging, 4(10), 118, 2018. DOI: 10.3390/jimaging4100118 Becker F., Li Z.-L. Surface temperature and emissivity at various scales: definition, measurement and related problems. Remote Sensing Reviews, 12, 3-4, P.225253, 1995. DOI: 10.1080/02757259509532286 Becker S., Haala N. Determination and improvement of spatial resolution for digital aerial images. ISPRS Archives, XXXVI-1/W3, P.49-54, 2005. Boori M.S., Paringer R., Choudhary K., Kupriyanov A. Comparison of hyperspectral and multispectral imagery to building a spectral library and land cover
362 Recognition and Perception of Images classification performance. Computer Optics, 42, 6, P.1035-1045, 2018. DOI: 10.18287/2412-6179-2018-42-6-1035-1045 Boreman G.D. Modulation Transfer Function in Optical and Electro-Optical Systems, 110 p, Bellingham: SPIE Press, 2001. DOI 10.1117/3.419857 Bracewell R. Fourier Analysis and Imaging, 701 p, New York: Kluwer Academic / Plenum Publishers, 2003. DOI: 10.1007/978-1-4419-8963-5 Briggs W.L., Henson V.E. The DFT: An Owner’s Manual for the Discrete Fourier Transform. Society for Industrial and Applied Mathematics, Philadelphia, 1995. Brodu N. Super-resolving multiresolution images with band-independent geometry of multispectral pixels. IEEE Transactions on Geoscience and Remote Sensing, 55, 8, P.4610-4617, 2017. DOI: 10.1109/TGRS.2017.2694881 Calle A., Romo A. et al., Analysis of forest fire parameters using BIRD, MODIS & MSG- SEVIRI sensors, in: New Strategies for European Remote Sensing, M. Oluic (Ed.), P. 109-116, Millpress, Rotterdam, 2005. Chen Y., Ming D., Zhao L., Lv B., Zhou K., Qing Y. Review on high spatial resolution remote sensing image segmentation evaluation. Photogrammetric Engineering and Remote Sensing, 84, 10, P.629-646, 2018. DOI: 10.14358/ PERS.84.10.629 Dadras Javan F., Samadzadegan F., Mehravar S., Toosi A. A review on spatial quality assessment methods for evaluation of pan-sharpened satellite imagery. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XLII (4/W18), P.255-261, 2019. DOI: 10.5194/ isprs-archives-XLII-4-W18-255-2019 Descloux A., Grußmayer K.S., Radenovic A. Parameter-free image resolution estimation based on decorrelation analysis. Nature Methods, 16 (9), P.918-924, 2019. DOI: 10.1038/s41592-019-0515-7 Ehlers M. Multi-image fusion in remote sensing: spatial enhancement vs. spectral characteristics preservation, in: Proceedings of the 4th International Symposium on Visual Computing (ISVC 2008), G. Bebis, R. Boyle, B. P., D. Koracin, P. Remagnino, F. Porikli, J. Peters, J. Klosowski, L. Arns, Y.K. Chun, T.-M. Rhyne, L. Monroe (Eds.), P. 75-84, Berlin: Springer, 2008. DOI: 10.1007/978-3-540-89646-3_8 Elvidge C.D., Zhizhin M., Hsu F.-C. et al., Long-wave infrared identification of smoldering peat fires in Indonesia with nighttime Landsat data. Environmental Research Letters, 10, 6, P. 1-12, 2015. DOI: 10.1088/1748-9326/10/6/065002 Eremeyev V.V., Knyazkov P.A., Moskvitin A.E. Evaluation of the resolution of aerospace imagery using statistical analysis (in Russian). Digital Signal Processing, 10, 3, P.27-30, 2009. Feng X., Li J. Evaluation of the Spatial Pattern of the Resolution-Enhanced Thermal Data for Urban Area. Hindawi Journal of Sensors, Article ID 3427321, 15 p., 2020. DOI: 10.1155/2020/3427321 Fonseca L., Namikawa L., Castejon E., Carvalho L., Pinho C., Pagamisse A. Image fusion for remote sensing applications, in: Image Fusion and Its Applications, Y. Zheng (Ed.), P.153-178, London: IntechOpen, 2011. DOI: 10.5772/22899
Remote Sensing Imagery 363 Fu W., Ma J., Chen P., Chen F. Remote sensing satellites for Digital Earth, in: Manual of Digital Earth, H. Guo, M.F. Goodchild, A. Annoni (Eds.), P. 55-124, Singapore: Springer Nature, 2020. DOI: 10.1007/978-981-32-9915-3_3 Goudail F., Réfrégier P., Delyon G. Bhattacharyya distance as a contrast parameter for statistical processing of noisy optical images. Journal of the Optical Society of America, 21, 7, P.1231-1240, 2004. DOI: 10.1364/JOSAA.21.001231 Grabski V. Simultaneous pixel detection probabilities and spatial resolution estimation of pixelized detectors by means of correlation measurements. Nuclear Instruments and Methods in Physics, 586, 2, P.314-326, 2008. DOI: 10.1016/j. nima.2007.11.053 Hoppner F., Klawonn F., Kruse R., Runkler T. Fuzzy Cluster Analysis, New York, Wiley, 2000. Jagalingam P., Hegde A.V. A review of quality metrics for fused image. Proceedings of International Conference on Water Resources, Coastal and Ocean Engineering (ICWRCOE 2015), Mangalore: Elsevier, 4, P.133-142, 2015. DOI: 10.1016/j. aqpro.2015.02.019 Jiménez-Muñoz J.C. Sobrino J. A., Gillespie A. et al., Improved land surface emissivities over agricultural areas using ASTER NDVI. Remote Sensing of Environment, 103, 4, P. 474-487, 2006. DOI: 10.1016/j.rse.2006.04.012 Kaspersen P.S., Fensholt R., Drews M. Using Landsat vegetation indices to estimate impervious surface fractions for European cities. Remote Sensing, 7, 6, P. 8224-8249, 2015. DOI: 10.3390/rs70608224 Kay S.M. Fundamentals of Statistical Signal Processing: Estimation Theory, 596 p, Englewood Cliffs: Prentice-Hall, 1993. Keys R. Cubic convolution interpolation for digital image processing. IEEE Transactions on Acoustics, Speech, and Signal Processing, 29, 6, P.1153-1160, 1981. Kononov V.I. Relation of information and probabilistic estimates of imaging systems (in Russian). Optical and Mechanical Industry, 58, 11, P.13-18, 1991. Kuenzer C., Dech S. (Eds.), Thermal Infrared Remote Sensing. Sensors, Methods, Applications, Springer, 2013. DOI: 10.1007/978-94-007-6639-6 Landgrebe D.A. Signal Theory Methods in Multispectral Remote Sensing, 528 p, Hoboken: Wiley-Interscience, 2003. Lauren B., Lee L. Perceptual information processing system, Paravue Inc. U.S. Patent Application 10/618,543, filed July 11, 2003. Li F., Jia X., D. Fraser, Superresolution reconstruction of multispectral data for improved image classification. IEEE Geoscience and Remote Sensing Letters, 6, 4, P. 689-693, 2009. DOI: 10.1109/LGRS.2009.2023604 Li H. Detection probability estimation method and calculation model of photoelectric detection target in complex background. Optoelectronics and Advanced Materials – Rapid Communications, 12, 1-2, P.18-24, 2018. Lillesand T.M., Kiefer R.W., Chipman J.W. Remote Sensing and Image Interpretation, 768 p, Hoboken: John Wiley & Sons, 2015. Liwen Z., Ming S., Na L., Qipeng Z., Xiangzheng C. Research on acquisition probability of the visible light reconnaissance equipment to ground targets.
364 Recognition and Perception of Images International Journal of Engineering, 29, 4, P.500-504, 2016. DOI: 10.5829/ idosi.ije.2016.29.04a.08 Lu D., Weng Q. A survey of image classification methods and techniques for improving classification performance. International Journal of Remote Sensing, 28, 5, P. 823-870, 2007. DOI: 10.1080/01431160600746456 Lyalko V.І., Popov M.A., Stankevich S.A., Shklyar S.V., Podorvan V.N., Likholit N.I., Tyagur V.M., Dobrovolska C,V. Subpixel processing of images from the frame infrared camera for future Ukrainian remote sensing satellite system. Proceedings of the 10th International Conference on Digital Technologies, Žilina, Slovakia, P. 221-224, 2014. DOI:10.1109/DT.2014.6868717 Mia M.B., Fujimitsu Y., Nishijima J. Thermal Activity Monitoring of an Active Volcano Using Landsat 8/OLI-TIRS Sensor Images: A Case Study at the Aso Volcanic Area in Southwest Japan. Geosciences, 7, 4, P.118-134, 2017. DOI: 10.3390/geosciences7040118 Miller M.D., Thornton M.A. Multiple Valued Logic: Concepts and Representations. Synthesis Lectures on Digital Circuits and Systems, Morgan and Claypool Publishers, 2007. DOI: 10.2200/S00065ED1V01Y200709DCS012 Moraga C., Trillas E., Guadarrama S. Multiple-Valued Logic and Artificial Intelligence Fundamentals of Fuzzy Control Revisited. Artificial Intelligence Review, 20, 3–4, P.169–197, 2003. DOI: 10.1023/B:AIRE.0000006610.94970.1d Na L., Zhang J., Bao Y., Bao Y., Na R., Tong S., Si A. Himawari-8 Satellite Based Dynamic Monitoring of Grassland Fire in China-Mongolia Border Regions. Sensors, 18, 1, P.276-290, 2018. DOI: 10.3390/s18010276 Pandit V.R., Bhiwani R.J. Image fusion in remote sensing. International Journal of Computer Applications, 120, 10, P.22-32, 2015. DOI: 10.5120/21263-3846 Pedrycz W. Knowledge Based Clustering. From Data to Information Granules, Hoboken, Wiley, 2005. DOI: 10.1002/0471708607. Ch 13. Piestova I. Multispectral Remote Sensing Imagery Superresolution: After Resampling. Proceedings of the International Scientific Conference Modern Challenges in Telecommunications, P.336-338, 2018. Piestova I., Stankevich S., Kostolny J. Multispectral imagery superresolution with logical reallocation of spectra. Proceedings of the International Conference on Information and Digital Technologies (IDT 2017), IEEE, Žilina, P. 322-326, 2017. DOI: 10.1109/DT.2017.8024316 Popov M.A., Stankevich S.A. About restoration of the scanning images received onboard a Sich-1M space vehicle by inverse filtering method, Proceedings of the 31st International Symposium on Remote Sensing of Environment, ISPRS, Saint Petersburg, P.488-490, 2005. Qu J.J., Gao W., Kafatos M., Murphy R.E., Salomonson V.V. (Eds.), Earth Science Satellite Remote Sensing. Vol.2: Data, Computational Processing, and Tools, Springer-Verlag Berlin Heidelberg, 2007. DOI: 10.1007/978-3-540-37294-3 Rukundo O., Cao H. Nearest neighbor value interpolation. International Journal of Advanced Computer Science and Applications, 3, 4, P.25-30, 2012. DOI: 10.14569/IJACSA.2012.030405
Remote Sensing Imagery 365 Salah M. A survey of modern classification techniques in remote sensing for improved image classification, Journal of Geomatics, 11, 1, P.118-138, 2017. Scherer, R., Computer Vision Methods for Fast Image Classification and Retrieval, 140 p., Cham: Springer Nature, 2020. DOI: 10.1007/978-3-030-12195-2 Schroeder W. (Ed.), Suomi National Polar-orbiting Partnership Visible Infrared Imaging Radiometer Suite S-NPP/VIIRS. 375 m Active Fire Detection Algorithm. User’s Guide. Department of Geographical Sciences, University of Maryland, 2015. DOI:10.5067/FIRMS/VIIRS/VNP14IMGT.NRT.001. Shapiro L.G., Stockman G.C. Computer Vision, Englewood, Prentice-Hall, 2001. Sobrino J.A., Jiménez-Muñoz J.C., Paolini L. Land surface temperature retrieval from LANDSAT TM 5. Remote Sensing of Environment, 90, 4, P.434-440, 2004. DOI: 10.1016/j.rse.2004.02.003 Stankevich S.A. Clarification of the known empirical formula for estimation of probability of the correct object interpretation on the aerospace image (in Ukrainian). Proceedings of the Ukrainian Air Force Scientific Center, 7, P.242246, 2004. Stankevich S.A., Gerda M.I. Small-size target’s automatic detection in multispectral image using equivalence principle. Central European Researchers Journal, 6, 1, 2020. Stankevich S.A., Ponomarenko S.A., Sobchuk A.M., Sholonik O.V. Estimation of the equivalent spatial resolution of multi-band aerospace imagery (in Ukrainian). Proceedings of the State Aviation Research Institute, 4, 11, P.9398, 2008. Stankevich S.A. Probabilistic-frequency evaluation of equivalent spatial resolution for multispectral aerospace images (in Ukrainian). Space Science & Technology, 12, 2-3, P. 79-82, 2006a. DOI: 10.15407/knit2006.02.079 Stankevich S.A. Quantitative analysis of informativeness of hyperspectral aerospace imagery in solving thematic tasks of Earth remote sensing (in Ukrainian). Reports of NAS of Ukraine, 10, P. 136-139, 2006b. Stankevich S.A., Shklyar S.V. Advanced algorithm for determination of edge spread function in digital aerospace image (in Russian). Scientific Notes of Taurida National V.I. Vernadsky University, 18(57), 2, P.97-102, 2005. Stankevich S.A., Shklyar S.V. Land-cover classification on hyperspectral aerospace images by spectral endmembers unmixing. Journal of Automation and Information Sciences, 38, 12, P. 31–41, 2006. DOI: 10.1615/J Automat Inf Scien.v38.i12.40 Stankevich S.A., Shklyar S.V., Podorvan V.N., Lubskyi M.S. Thermal infrared imagery informativity enhancement using sub-pixel co-registration. Proceedings of the International Conference on Information and Digital Technologies (IDT), Rzeszów, Poland, P.245-248, 2016. DOI: 10.1109/DT.2016.7557181 Stankevich S.A., Sholonik O.V. Adaptive multidimensional probabilistic transform for multispectral digital aerospace imagery (in Russian). Proceedings of the First Scientific Conference Earth and Space Sciences for Society, CASRE NAS of Ukraine, Kiev, P.11-14, 2007.
366 Recognition and Perception of Images Stankevich S.A. Statistical approach to determination of threshold modulation of digital aerospace imagery (in Ukrainian). Space Science & Technology, 11, 3-4, P.81-84, 2005. DOI: 10.15407/knit2005.03.081 Stankevich S., Zaitseva E., Piestova I., Rusnak P., Rabcan J. Satellite imagery spectral bands subpixel equalization based on ground classes’ topology. Proceedings of the International Conference on Information and Digital Technologies (IDT 2019), IEEE, Žilina, P.442-445, 2019. DOI: 10.1109/DT.2019.8813338 Stone H.S., Orchard M.T., Chang E.-C., Martucci S.A. A fast direct Fourier-based algorithm for subpixel registration of images. IEEE Transactions on Geoscience and Remote Sensing, 39, 10, P.2235-2243, 2001. DOI: 10.1109/36.957286 Tang H., Li Z.-L. Quantitative Remote Sensing in Thermal Infrared: Theory and Applications, Springer-Verlag, Berlin, 2014. DOI: 10.1007/978-3-642-42027-6 Thomas C., Ranchin T., Wald L., Chanussot J. Synthesis of multispectral images to high spatial resolution: A critical review of fusion methods based on remote sensing physics. IEEE Transactions on Geoscience and Remote Sensing, 46, 5, P.1301-1312, 2008. DOI: 10.1109/TGRS.2007.912448. Tominaga S. Object recognition using a multi-spectral imaging system. Machine Graphics & Vision International Journal, 11, 2/3, P.221-240, 2002. Valor E., Caselles V. Mapping land surface emissivity from NDVI: Application to European, African, and South American areas. Remote Sensing of Environment, 57, 3, P. 167-184, 1996. DOI: 10.1016/0034-4257(96)00039-9 Wang X., Ling J. Multiple valued logic approach for matching patient records in multiple databases, Journal of Biomedical Informatics, 45, 2, P.224-230, 2012. DOI: 10.1016/j.jbi.2011.10.009 Wang Z., Ziou D., Armenakis C., Li D., Li Q. A comparative analysis of image fusion methods. IEEE Transactions on Geoscience and Remote Sensing, 43, 6, P.1391-1402, 2005. DOI: 10.1109/TGRS.2005.846874 Wei M., Xing F., You Z. A real-time detection and positioning method for small and weak targets using a 1D morphology-based approach in 2D images. Light: Science & Applications, 7, 18006, 2018. DOI: 10.1038/lsa.2018.6. Xu Q., Zhang Y., Li B. Recent advances in pansharpening and key problems in applications. International Journal of Image and Data Fusion, 5, 3, P.175-195, 2014. DOI: 10.1080/19479832.2014.889227 Yanushkevich S.N., Shmerko V.P., Lyshevski S.E. Logic Design of NanoICSs, CRC Press, Boca Raton, FL, 2005. Zaitseva E., Levashenko V., Kostolny J. Multi-state system importance analysis based on direct partial logic derivative. Proc. Int. Conf. on Quality, Reliability, Risk, Maintenance, and Safety Engineering, Chengdu, China, P.1514-1519, 2012. DOI:10.1109/ICQR2MSE.2012.6246513 Zaitseva E., Levashenko V. Reliability analysis of multi-state system with application of multiple-valued logic, International Journal of Quality and Reliability Management, 34, 6, P.862-878, 2017. DOI: 10.1108/IJQRM-06-2016-0081
Remote Sensing Imagery 367 Zaitseva E., Lubskyi M., Rabčan J. Application of fuzzy filtering for thermal infrared satellite data resolution enhancement. Central European Researchers Journal, 5, 1, P.73-80, 2019. Zaitseva E., Piestova I., Rabcan J., Rusnak P. Multiple-Valued and Fuzzy Logics Application to Remote Sensing Data Analysis. 26th Telecommunications Forum (TELFOR), Belgrade, Serbia, P.1-4, 2018. DOI: 10.1109/telfor.2018.8612109 Zhou W., Bovik A.C., Sheikh H.R., Simoncelli E.P. Image quality assessment: from error visibility to structural similarity, IEEE Transactions on Image Processing, 13, 4, P.600-612, 2004. DOI: 10.1109/TIP.2003.819861 Zhukov B.S., Popov M.A., Stankevich S.A. Multispectral multiresolution image synthesis using library object spectra for regularization (in Russian). Current Problems in Remote Sensing of the Earth from Space, 11, 2, P.50-67, 2014.
10 The Theoretical and Technological Peculiarities of Aerospace Imagery Processing and Interpretation By Means of Artificial Neural Networks Oleg G. Gvozdev
*
Moscow State University of Geodesy and Cartography, Department of Information and Measurement Systems, Leading Specialist of the Center for Industry Monitoring Systems and Information Security, Researcher Research Institute “Aerospace”, Moscow, Russia
Abstract
The chapter analyzes the peculiarities of aerospace imagery (ASI) as well as its influence on the processes of ASI processing and interpretation. The first section envisages the peculiarities of aerospace imagery and ways of its digital representation. The problems which are to be solved to ensure the successful application of artificial neural networks (ANN) for ASI are pointed out. The second section studies the peculiarities of ASI processing operations. Mainly, it analyzes the process of preparation of data, which are necessary for the interpretation carried out by means of ANN. The third section is dedicated to the methods and ways to build and learn ANN used for ASI interpretation. The comprehensive framework for the building of ANN topology that allows to create the point solutions considering ASI peculiarities are offered in it. The well-known, and, in particular, classical methods of images processing which are applied for ASI were collected and envisaged in this section. Keywords: Aerospace imagery, satellite imagery, image interpretation, artificial neural networks, machine learning, semantic segmentation, artificial neural networks topology, application programming interface ergonomics
Email: [email protected]
*
Iftikhar B. Abbasov (ed.) Recognition and Perception of Images: Fundamentals and Applications, (369–444) © 2021 Scrivener Publishing LLC
369
370 Recognition and Perception of Images
Terms and Abbreviations ASI, AeroSpace Imagery, Satellite Imagery is a satellite and endoatmospheric aircraft imagery data. Augmentation is a process of increasing of dataset representativeness by means of multiple use of every sample item from the origin set to which the controlled random distortions are applied. Instance segmentation is a common problem consisting in the application of machine learning to the images. It offers the uniting of image points into the sets, which correspond to the specific object instances, and determines the classes of these objects. Georeferencing is a process of mapping of intrinsic frame of imagery reference and coordinate frame of the imaged locality. ANN, Artificial Neural Network is a mathematical model and its software or hardware implementation based on the principles of functioning of biological neural networks. Image channel/spectral structure is an ordered set of channels that builds up the imagery and includes the measuring and contiguous data. The transfer function describing the connection of committed electromagnetic radiation with the size of channel values is determined for every channel. Complexing (as applied to ASI) is the processing operation that supposes the involvement of data obtained at different times and by various ways, which use the diverse representations and others. Off-nadir pointing is a pointing co-directional with the vector of gravitational force at this point. ANN model is a machine learning model built around one or several artificial neural networks. ANN topology is a structure of elementary base blocks that forms the certain structure of artificial neural network. ANN method is a method of machine learning based on the application of one or several ANN models to the data. Instance feature extraction is a common problem consisting in the application of machine learning that suggests the determination of qualitative and quantitative parameters of the extracted instances. Orthorectification is a process of correction of geometric distortions caused by the camera station, local topography, optical distortions and other factors thus resulting in the obtaining of image with the uniform scale in the entire area (spatial resolution).
Peculiarities of Aerospace Imagery Processing 371 Pansharpening is a process of increasing of spatial resolution of multichannel imagery by means of involvement of single-channel imagery with the larger spatial resolution. API, Application Programming Interface is a formalized way (functions, methods, classes, protocols, etc.) to ensure the interaction of computer programs or their components. Spatial resolution is a correlation of distance between the centers of two adjacent pixels of terrain image and the relevant ground distance. Radiometric resolution, radiometric resolution capability is a range of brightness levels distinguished on the image. Image labeling is a process of images interpretation and recording of interpretation results in the machine-readable format for later use as data necessary to learn or test the models powered by the machine learning. Semantic segmentation is a common problem consisting in the application of machine learning for imagery, which suggests the correlating of every image pixel with the class relating to its semantics. Semantic comparison is a common problem consisting in the application of machine learning for imagery, which suggests the extraction of significant variations for the pairs or series of fused images. Framework is a platform to build up the specialized solutions by means of exploiting of extension slots available in it.
10.1 Introduction One of the determining factors of the successful operations of either public institutions, social services or commercial enterprises nowadays is the immediate acquiring of knowledge about the environment. The aerospace imagery interpretation is a rather new channel ensuring the obtaining and actualizing of such knowledge. The extensive use of multi-rotor aircraft made a particular contribution to the availability of such a knowledge-acquiring method. Moreover, the data of such space programs as Landsat [Landsat Science] [Short, NASA 2011] and MODIS [MODIS Components] [MODIS Data] are available upon the terms of public ownership. Other data (on a commercial basis): the data of WorldView-3 and WorldView-4 satellites [DigitalGlobe Satellite imagery] [GeoPortal Directory WorldView-3] [Apollo Mapping WV-3] which panchromic channel features the spatial resolution of 31-34 cm/pixel (of 1.24 m/pixel for 8 multispectral channels) are worthy of special attention.
372 Recognition and Perception of Images These and other systems of aerospatial survey allow to get the hundreds and thousands of megapixels a day. Thus, the issue concerning the automation of their processing and interpretation is extremely important. One of the most active research areas today is the automation of imagery processing and interpretation as well as extraction of knowledge from them. The predominant approach in this area consists in the application of methods of machine learning, and mainly, of artificial neural networks. The specifics of aerospace imagery as well as the peculiarities of its separate classes and tasks, which are solved on it, considerably restrict the direct applicability of general-purpose solutions and often require the development of completely special methods and approaches. The concerned area of expertise is replete with numerous works dedicated to the solving of certain applied problems or rather to the introduction of new methods. However, the general problematics of the aerospace imagery processing and interpretation by means of contemporary ANN methods, as well as generalization and analysis of the existing experience applicability and forming of comprehensive approach to the building of point solutions, are examined ineffectually, in the author’s opinion.
This work is an attempt to close the gap It is necessary to consider that this work is not and makes no claim to be the textbook or manual for the big data processing, machine learning or ANN technologies. But it is strictly focused on the peculiarities of aerospace imagery. It is supposed that the reader is familiar with the general principles of big data processing organization, methods of machine learning, building up and application of ANN models. The first section is dedicated to the survey of peculiarities of aerospace imagery and setting of problems for which a solution is necessary to ensure the successful application of machine learning methods and, in particular, of the artificial neural networks to them. The second section analyses the problems of building of processing pipeline necessary for the aerospace imagery processing as well as points out some procedures specific for it. The third section envisages the methods and strategies to build up and train ANN models. The comprehensive framework to build up the neural networks topology for aerospace imagery processing and interpretation is provided here. Moreover, the authors offered a number of original solutions in addition to the analysis of well-known methods applicability.
Peculiarities of Aerospace Imagery Processing 373
10.2 Peculiarities of Aerospace Imagery, Ways of its Digital Representation and Tasks Solved on It The rectangular raster with the square pixel elements on the uniform grid that feature one or several spatially associated channels each of which is compliant with the part of electromagnetic spectrum, value of the mathematical function (on its basis) or additional random information (including the imagery interpretation results) are the objects of methods used for the processing and interpretation of images obtained through the registration of electromagnetic emission. Although the arbitrary structure of grid, the shape of raster and its pixel elements are theoretically possible in the practical tasks of digital imagery processing, the described variant is a predominant standard representation. Unless otherwise specified, this particular special case is understood to be a raster within the scope of this work. The separate technological processes, which highly weight towards the initial image acquisition and its conversion into the standard representation, often use the specific raster types. A case in point is a Bayer filter, its modifications and analogs as well as the corresponding technological process of postprocessing [Bayer, 1976]. The similar problem is also urgent for the aerospace imagery. The processes of initial images processing are not considered within the scope of this work but attention is given to the artifacts of their carrying-out. According to the automated and automatic interpretation as well as controllability of conditions of image acquisition it is possible to distinguish four classes from the set of all images (classification 1): 1. S ynthetic images. The imagery of this class is obtained by means of the artistic and mathematical methods or through its combination. The absolute control of properties of the resultant image by the creator is peculiar for it. And its interpretation is not generally required due to the availability of source data. This imagery may be of interest within the scope of this research to deal with the deficiency of training data for ANN models. 2. Images generated from the artistic goals. The imaging conditions may be both controlled and natural. This class does not suppose the automated interpretation. However, it found a use for the automation of categorization and
374 Recognition and Perception of Images catalogization, search and description in recent years. This class is included for the comprehensive classification and is not regarded as the subject of detailed consideration within the scope of this research. 3. Images acquired to capture the state of physical world objects (e.g., the video observations and photojournalism data). The imaging conditions are generally uncontrolled or controlled within very certain limits. This class is an object of the automatic analysis commonly used for similar purposes (2) or for the purpose of extracting the knowledge which specific list depends on the imagery semantics. 4. Images acquired for the research objective. They are differentiated by the high level of controllability and reproducibility of conditions, the considerably limited subject area of objects and the fact that they are created with due regard to the further technological operations of processing. The biomedical data acquired through X-ray photography or by means of tomography, microscopy, ultrasound investigation, etc., as well as data of active sonars, infrared cameras and other nondestructive inspection aids may be referred to them. The compatibility with the human perception may be the other criterion used for the classification of images (classification 2): 1. The images which are similar to ones available for the direct perception by man. Optical spectrum. “From eye level” view angle. The wide subject area of objects corresponding to the scale of objects which are available for direct human perception. 2. The images of class (1) differ by the camera angle. 3. The images which are not comparable with the ones common for humans: 3.1. Considerably distinct at the level of captured spectral range or its representation. 3.2. Highly different at the level of subject area of objects (small and large scale objects). The classifications offered by the authors allow the consideration of ASI within the context of all images set.
Peculiarities of Aerospace Imagery Processing 375 The fact that ASI is the product of a complex technological process and initial data for many other scientific, research and production processes (including the tasks of automatic interpretation envisaged within the scope of this work) associates it with the class (4) of classification (1). They are properly standardized within the class (definition of objective, filming scheme, altitude, etc.). At the same time, ASI belonging to different classes may considerably differ. The subject area of objects on ASI is not limited, and the spectrum of current tasks is rather wide. This is peculiar to class (3) of the same classification. According to classification (2), ASI may be placed in class (2) while reflecting the objects customary for man (roads, buildings) in the optical spectrum but from an unusual angle (off-nadir pointing). However, they cannot combine the properties of classes (3.1) and (3.2) while outlining the objects unusual for man (e.g., show of oil and gas within the open water areas or glacial drift within the unusual spectrum (i.e., within the radio-frequency range)). The representative ASI examples are shown in Figures 10.1 and 10.2. Thus, it is obvious that ASI is extremely diverse. More importantly, it represents the view that is distinctly different from ordinary human visual experience [Campbell, 2002] thus making the imagery perception not only limited but also deceptive in some cases. In light of this, it may be concluded that the naive adaptation of methods and technologies designed for other more studied tasks for ASI cannot guarantee the positive result. And the analysis of technical peculiarities hereinafter contained shows that practically all of them require the essential rework to consider ASI peculiarities without which they will be fully invalid.
10.2.1 Peculiarities of Technological Aerospace Imaging Process The peculiarities of certain technological processes of ASI imaging are determined by the objectives definition, optical system, structure of the primary measuring transducers, height of exposure (orbit) and many other factors which are not the subject matter of this work as mentioned above. However, the objective of this work consists in the necessity to analyze and consider the stages of ASI imaging which are common for all technological processes. The evident ASI peculiarity is the off-nadir pointing survey with the deviation not exceeding 30 degrees. Such aspect angle limits the applicability of models pretrained on other data for ASI.
376 Recognition and Perception of Images
Figure 10.1 Examples of images acquired by means of aerospace methods. Sources: [Geoserve, Radar Data] [NASA, 2010] [Digital Globe, 2014].
Contrary to the per-frame shooting used in the traditional photography the “push broom” scheme also known as scanning line that suggests the obtaining of one row of pixels at a time is widely used in the course of ASI imaging [Earth Observing-1]. Due to the camera movement and rotation it is necessary to follow the special procedure for the correction of distortions caused by the angular difference of survey between the pixel lines. The example of ASI before and after elimination of push broom effect is shown in Figure 10.2. The use of ASI enables the determination of coordinates and distances with the help of image. To ensure this opportunity it is necessary to carry out the operations to eliminate the optical, perspective and other
Peculiarities of Aerospace Imagery Processing 377
Figure 10.2 Example of image obtained by means of push broom method before and after elimination of distortions. Source: [Zhang, Hu et al., 2015].
geometrical distortions as well as convert the projections (orthorectification) and georeferencing. This conversions cause the accumulation of differences between the geometry of square raster and obtained important image area (Figure 10.3). This results in the occurrence of ASI pixels with no data, which are necessary for putting the image in the processible rectangular form. Raw Image
Ortorectified Image
Void Areas
Payload Data
Raster Boundaries
Figure 10.3 Image before and after orthotransformation.
378 Recognition and Perception of Images In the course of development of pipeline for ASI interpretation it is necessary to minimize the data loss due to downsampling, quantization errors, etc., to ensure the proper positioning of objects on them.
10.2.2 Aerospace Imagery Defects ASI is fraught with all problems of surveying in the uncontrolled conditions. Such problems include the overexposure, underexposure, flares (Figure 10.4), motion blur and many others. Except of factors which, are usual for the photographers, ASI is greatly influenced by the atmospheric effects (up to the complete unavailability of target objects). The peculiarities of ensuring of different image channels may lead to their impossible pixel-wise correspondence, which results in the occurrence of regular areolas around the objects on those images (Figure 10.5). Moreover, the fast-moving objects (cars) may be found in the different points of the image on the various channels (Figure 10.6). According to the accessibility of images with the type required for the conducted works, the images with these defects may be rejected. However, the practically significant cases when it is also necessary to interpret the defective images are not infrequent.
10.2.3 Aerospace Imagery Channel/Spectral Structure The authors offer the following definition of the image channel/spectral structure.
Figure 10.4 Example of incorrectly exposed image. Source image – on the left, and synthesized image – on the right.
Peculiarities of Aerospace Imagery Processing 379
Figure 10.5 Example of the specific distortions in case of channels misalignment. Source image – from the left, and synthesized image – from the right.
Figure 10.6 Distortion peculiar to the moving object. Synthesized image.
The image channel/spectral structure is an ordered set of channels each of which is described by the sequential number (start from zero) in the raster or by the number of single-channel raster in the set as well as by the transfer function that describes the influence of frequency components of electromagnetic emission on the quantity value in the channel. This definition does not consider the possibility of occurrence of random channels in the image. However, it is sufficient for the consideration of this problem. The most commonly used object of automatic imagery
380 Recognition and Perception of Images interpretation is represented by class 2 and 3 of classification 1 and class 1 and 2 of classification 2. They possess three channels responsible for the broad band of optical spectrum: red, blue and green. The rarer cases are represented by the medical data (X-ray pictures, ultrasound and tomography findings, etc.) which are assumed to possess one channel. Finally, there are cases of multi- and hyperspectral survey used, for example, in the course of painting restoration [Deborah, 2013]. Just the final variant, i.e., use of numerous channels, is peculiar to ASI. The channel/spectral structure of satellite imaging equipment is projected in accordance with the goals of certain program and existing restrictions. In case of aerosurveying the channel/spectral structure is determined by the chosen equipment which, in its turn, also depends on the goals and restrictions. The optical properties of channels of one image may differ. The optical axes may be offset or angled from one another. The properties of measuring transducers used for various channels may also differ. This ASI property also limits the applicability of pretrained models. The special case is represented by the necessity to involve the images with the different channel/spectral structures in the process of learning and interpretation. The hyperspectral images with the distinguished regular channel/ spectral structure, that embraces many (30-500) channels consequently running with the same frequency step and narrow bandwidth, represent another case worth of attention.
10.2.4 Aerospace Imagery Spatial Resolution The distance between the centers of two adjacent pixels on the ground is one of agreed-upon definitions of the spatial resolution [Lee, Sull, 2019] [Propeller Aero, 2018]. The spatial resolution of advanced satellite images is within the range of 0.3 meters/pixel – 30 meters/pixel [Landsat Science] [DigitalGlobe Satellite imagery] [Apollo Mapping Worldview-3 Satellite imagery]. It is possible to obtain the images of subcentimeter spatial resolution by means of aerosurveying. The pipeline of images postprocessing and the position of imaging equipment against the ground surface have an effect on the spatial resolution. It is necessary to admit that numerous imaging systems feature the combined structure suggesting the involvement of channels with the different spatial resolution into the scheme [DigitalGlobe Satellite imagery]. The case when it is necessary to process the images of broad band spatial
Peculiarities of Aerospace Imagery Processing 381 resolution within the single model is of special interest as the channel/ spectral structure.
10.2.5 Radiometric Resolution of Aerospace Imagery The radiometric resolution may be identified as the capability of imaging equipment to differentiate the certain number of radiation intensity levels [Campbell, 2002]. This value is usually expressed for the digital equipment as the pixel bit depth. The typical values and examples of their application are shown in Table 10.1. The similar value for photo and video equipment working in the optical range is called “pixel bit depth”. The unsigned integer representation with the length of 1, 2 or 4 bytes or the floating-point single-precision representation within the range of 0-1 or from -1 to 1 is used in practice to ensure the easy processing by means of the current computing equipment [Van der Walt, Schönberger et al., 2014]. The latter is the most actual for ANN models [Paszke, Gross, Massa, 2019]. Depending on the technological process of processing, the actual range of image values may be substantially lower than the available range (Figure 10.7). Due to this, SNR (signal/noise ratio) indicator is of great practical importance. At the same time, the representation in RGB color model (8 bit/channel/ pixel) is generally used for data visualization. These peculiarities are also to be considered when designing the ASI processing pipelines. Table 10.1 Typical values of channel bit depth. Usage examples. Bit
Values
Usage examples
8
256
Ordinary personal computers
9
512
10
1024
HDMI 1.3
11
2048
WorldView-3/4
12
4096
LandSat-8
13
8192
14
16384
15
32768
16
65536
Canon EOS 5D Mark IV
Professional digital image processing software
382 Recognition and Perception of Images
Figure 10.7 Specific bar diagram of image obtained by means of aerospace methods. Top to bottom: red, blue and green channels.
10.2.6 Aerospace Imagery Data Volumes The relative “infinity” determined by the geometry of earth surface is an important aspect of processing of aerospace imagery that distinguishes it from the images which may be directly perceived by man. The characteristics of images and estimated volumes of datasets used in the course of optimal/standard coding as well as after pansharpening of such satellites as Landsat [Landsat Science], GeoEye and WorldView-3/4 [DigitalGlobe Satellite imagery] are represented in Tables 10.2 and 10.3. The calculations concerning the volume of images that cover some cities of the world are represented in Table 10.4. The data represented in the tables allow to provide the data volumes of aerospace survey. Thus, ASI processing systems must be able to store and process the datasets containing the tens of gigabytes (if such data were subject only to preprocessing) and hundreds of gigabytes for pansharpening results. It is necessary to admit that lossless data compression methods are ineffective both for ASI and any other images of natural origin. The lossy
Peculiarities of Aerospace Imagery Processing 383 Table 10.2 Characteristics of images formed by Landsat, GeoEye, WorldView-3/4 satellites. Panchromatic band
Multispectral bands
Satellite
m/px
bit/px
px/km2
Count
m/px
bit/px
px/km2
Landsat
15
12
4444.44
8
30
12
1111.11
GeoEye-1
0.41
11
5948839.98
4
1.65
11
367309.46
WV-3/4
0.31
11
10405827.26
8
1.24
11
650364.20
Table 10.3 Estimated volumes of datasets used in the course of optimal/standard coding as well as after pansharpening of Landsat, GeoEye, WorldView-3/4 satellites. Data size per km2, MB Optimal encoding
Standard encoding
Satellite
Source bit/px
16 bit/px
Landsat
0.0191
0.0254
0.0763
GeoEye-1
9.7273
14.1489
56.7326
WV-3/4
20.4678
29.7713
178.6279
Pansharpened
compression methods are badly adopted for the complex channel/spectral ASI structure in most cases. Moreover, they apparently lower the image quality that is totally unacceptable in the case of small object recognition because it limits their acceptability. Besides, the uncompressed representation permits the application of software optimizations, which are based on the direct projections of files into the virtual memory (memory mapping). The monitoring tasks suggesting the continuous surveying of the same objects and comparative analysis of the observation results thus provide the exponential growth of volumes depending upon the survey frequency. The frequency, in its turn, results from the speed of observed processes. The technological processes of ASI processing and interpretation generally suppose the storing of numerous intermediate processing results that also leads to the exponential growth of data volumes, which required to store.
City
New York City
Moscow
Tokyo
Mexico City
London
Country
USA
Russia
Japan
Mexico
Great Britan
1572.00
1485.00
2193.00
2511.00
1213.37
Area, km2
0.04
0.04
0.05
0.06
0.03
Raw
Landsat
0.12
0.11
0.16
0.19
0.09
P/S
Dataset size, GB
21.72
20.52
30.30
34.70
16.77
Raw
GeoEye
87.09
82.27
121.50
139.12
67.22
P/S
45.70
43.17
63.76
73.00
35.28
Raw
274.22
259.05
382.55
438.02
211.66
P/S
WorldView-3/4
Table 10.4 Estimated volumes of datasets for some big cities of the world for Landsat, GeoEye and WorldView-3/4 satellites. Raw – volume of initial data of the standard encoding. P/S – volume of pansharpening result of the standard encoding.
384 Recognition and Perception of Images
Peculiarities of Aerospace Imagery Processing 385
10.2.7 Aerospace Imagery Labeling The machine learning methods based on the principle of supervised training presuppose the availability of broad volume of the training data. The value equal to 1,000 examples per class may serve as the initial mark to determine the necessary number of training samples [Warden, 2017]. The more precise estimation methods based on the number of variable object properties [Van Smeden, Moons et al., 2019] [Harrell, Lee et al., 1984] [Mitsa, 2019] steadily demonstrate the high values. It is possible to meet the considerable number of variable factors (distance to the object on image, lighting, color, etc.) even in the conditions of ordinary photography. The broad spatial coverage is peculiar to ASI. This means that it is possible to map the different climatic zones, various types of vegetation as well as types of natural and man-made objects onto it. Thus, it is necessary to pass from the number of samples per class to the number of samples per one environment condition cluster per class. So, Koppen’s classification of climatic zones [Arnfield, 2020] includes 30 zones. This allows to provide the volume of necessary training data to build the actual models of the larger part of the land mass. As mentioned earlier, the ordinary visual experience and intuitive perception are not only limited in reference to ASI (especially, of non-optical range) but may prove to be deceptive. As a consequence, the hand data labeling requires the involvement of specialists trained to recognize ASI. Nonetheless, the interpretation of ASI as well as all earth remotely sensed data may require the validation by means of ground observations [Schowengerdt, 2007] [Natural Resources Canada, 2019]. Not only the visual identification and localization of objects or phenomena by operator but also the representation of these data in the machine-readable format is necessary to build the training samples. This is a labour-intensive and long-term work. Due to the nature of recognizable entities or phenomena the labeling may be formed in the raster and vector form. The first suggests the denotation of one or more classes for each of ASI pixels, and the other – description of objects structure with the application of primitive components of vector plot: points, polylines, polygons and others. The total stated factors determine the complexity, labour intensity and thus the high price of learning data acquisition, in particular, when this entails their immediate replenishment.
386 Recognition and Perception of Images
10.2.8 Limited Availability of Aerospace Imagery There is a pointed out labeling problem for the available images. However, the data-generating process itself (both from the space and aircraft) is fraught with problems. The first problem is economical one. Thus, ASI obtained from WorklView-2/3 satellite with the spatial resolution of 0.5 m/ pixel is available on the commercial basis [LandInfo]. The approximate (calculated by authors on the basis of public data [DigitalGlobe Satellite imagery]) costs for acquisition of full datasets for some cities of the world are shown in Table 10.5. Besides, ASI obtaining is related with the numerous legal matters and conflicts of interest that include the issues of social and national security, commercial secrets, etc. Their detailed consideration transcends the scope of this work. The second aspect of the problem is caused by the relative sparseness of entities and phenomena which are of the interest to be found on ASI. The air and railway accidents which are fixed on ASI in detail may be cited as an example.
10.2.9 Semantic Features of Aerospace Imagery Except for problems of a syntactic nature that include ASI representation and composition, the semantic features which consideration is necessary for the selection and development of interpretation methods is also peculiar to it. Table 10.5 The estimated costs for datasets concerning some large cities of the world (WorldView-3/4). WorldView-3/4 Older than 90 days
New
Country
City
Area, km
USA
New York City
1213.37
23054.03
35187.73
Russia
Moscow
2511.00
47709.00
72819.00
Japan
Tokyo
2193.00
41667.00
63597.00
Mexico
Mexico City
1485.00
28215.00
43065.00
Great Britan
London
1572.00
29868.00
45588.00
2
USD
Peculiarities of Aerospace Imagery Processing 387 The disbalance of classes is a common issue of the machine learning. ASI specificity is evident not only in the ratio of sample number per every class, but in the ratio of areas of objects belonging to different classes and also to one class. Thus, the area occupied by the blocks of flats of various types may differ by 20 times, and the total area of blocks of flats substantially concedes to the area occupied by the undeveloped grounds, i.e., forests, fenlands, deserts, etc. Moreover, the saturation of the territory with the objects of interest may vary considerably. It is obvious that there may be more objects of interest within the city boundaries than within the undeveloped grounds. The minimal bounding boxes of objects are of usual interest in the course of their recognition. In the advanced case, the pixel-wise labeling of objects belonging to the specific instances is of the interest. In the case of relatively low spatial resolution ASI may require not only the identification and localization of area objects but also of the point and linear objects (e.g., roads and power line poles). The objects on ASI often form the complex structures that require the correct interpretation (e.g., the multi-level junctions (Figure 10.8) or the complex structure buildings (Figure 10.9)). More generally, the interpretation of some objects is possible only with the consideration of the great spatiotemporal context in some cases. Moreover, it may require the involvement of additional features which relate to the acquisition parameters, state of the atmosphere, etc. The classification of back yard territories which may be distinguished from the parklands due to the context may be a demonstrative example.
10.2.10 The Tasks Solved by Means of Aerospace Imagery According to the authors’ experience, the whole set of tasks relating to ASI interpretation carried out by means of ANN may be narrowed down to the number of classic tasks rendering: Semantic segmentation – comparison of every pixel with the class determined by its semantics (building, road, car, etc.). Instance segmentation – the uniting of image pixels into varieties that correspond to the separate instances as well as determining of classes of these objects. Instance feature extraction – the determination of quantitative and qualitative parameters of the extracted instances
388 Recognition and Perception of Images
Figure 10.8 Example of aerospace imagery of MKAD multilevel junction – Leningrad motorway – Moscow.
(automobile body type, number of storeys in the residence building, etc.). Semantic comparison – the detection of significant changes in the pairs or sequences of compared images. For example, the automobiles moving on the road may be considered as non-significant change, but their occurrence on the protected area – as the significant one. Moreover, it is possible to notice the occurrence of a number of auxiliary processing tasks among which the superresolution task is the most significant in practical terms.
10.2.11 Conclusion As a result of generalization of generally available and personal experience, the authors pointed out the following peculiarities of the aerospace imagery that collectively precondition the difference of its processing and interpretation processes from the other classes of images:
Peculiarities of Aerospace Imagery Processing 389
Figure 10.9 Example of complex structure buildings. Khodynka Field, Moscow.
1. Off-nadir pointing shooting angle with the deviation of up to ±30 degrees. 2. The large volume of data both in general and within one image (continuous area, continuous data flow). Potentially high speed of data accumulation. 3. Fixation of different regions of emission spectrum. 4. Uncontrolled imaging conditions. Influence of atmospheric effects. Clouds. Under- and overexposure. Flares. 5. Complicated and highly variable pipeline of preprocessing that leaves the artifacts. 6. Opportunity to survey either in the single block or by scanning (the scanning line, push broom) modes. 7. Channels misalignment. 8. Raster zones containing no data (void). 9. Varying information richness. Existence of “open spaces”. 10. Fast-moving object effects. Blooming. Time parallax. 11. Different channel/spectral structure. 12. Different spatial resolution of channels.
390 Recognition and Perception of Images 13. Various radiometric resolution and ways of digital representation of measurements. The difference of actual range of values from the acceptable one in this encoding format. 14. The necessity to combine (jointly use) the images with the various channel/spectral structures, different spatial resolution and other characteristics. 15. The labeling complexity and labour intensity. The limited applicability of everyday visual experience in the course of labeling. 16. The quantitative disbalance of classes. 17. The areal disbalance of classes. 18. The complex spatial structure of recognized objects. 19. The non-locality of recognized instances and phenomena (dependence on the great space and time context). 20. Availability of point and linear objects together with the areal ones. 21. Limited accessibility to images.
10.3 Aerospace Imagery Preprocessing The performance of machine learning models is based on the correct implementation of numerous procedures of data pre- and postprocessing, preparation of training dataset, ensuring of their storage and effective access to them. ANN models of images processing intake the rasters on the uniform grid with the number of channels set at the stage of topology building. Typically, these models represent the measurements in the form of numbers with the floating decimal separator of short precision. Nowadays, the efficiency that is sufficient to solve the applied tasks is ensured by means of GPGPU technologies, in particular, CUDA technology which permits the use of the specialized graphic processors to speed up the calculations. This results in the occurrence of the following limiting factor, i.e., the volume of RAM on board of GPGPU. At the time of writing this work, the volume of RAM of graphic processors for consumer segment was 24 Gb (nVidia Titan RTX [NVIDIA TITAN RTX]), and for professional segment – 32 Gb (nVidia Tesla V100 [NVIDIA TESLA V100]). Due to the peculiarities of stochastic gradient descent algorithm and the memory implementation of contemporary graphic processors, ANN models learning requires the storage of not only the initial data and tunable parameters, but also of gradients, generated errors and momentum. This
Peculiarities of Aerospace Imagery Processing 391 leads to the exponential growth of memory footprint both in case of growth of ANN model volume and of processed image resolution. These issues are considered in depth in the following works: [Deep Learning for Computer Vision, 2016] [Hanlon, 2017] [Bulatov, 2018] [Kasper Fredenslund]. Practically, the size of segment processed at a time depends on the complexity of ANN model topology, volume of available RAM and used programming library (PyTorch, TensorFlow, etc.) falling within the range of 512х512 to 2048х2048 (in case of three color channels (RGB)). This is by a factor of magnitude less than the practically meaningful ASI size. Thus, the high-priority task of ASI preprocessing required for its further processing by means of ANN is a segmentation that is necessary for the processing by graphic processor as well as ensuring of further result setting. This produces another problem that is known as the boundary effects (edge effects, border effects). It is stated in that the null values near the image edges introduce the distortions into the result of work of any window methods including ones used for ANN models. Further we will envisage the methods and techniques permitting to compensate these and other limitations indicated above. The authors assume that the processed ASI passed all stages of preprocessing and standardization specific for the technology of its acquisition.
10.3.1 Technological Stack of Aerospace Imagery Processing Currently, there is a great variety of software applications for the digital processing of signals and images. The authors endeavored to use only the open-source software and programming libraries while giving great attention to their technological and mathematical correctness, performance, usability and interoperability when making the choice. The authors formed the following technological stack during the work with ANN and ASI: –– CPython 3.6+ [Python] – underlying programming language. –– NumPy [NumPy] – library for multidimensional numeric arrays processing. –– SciPy [SciPy] – library for performing the commonly used mathematical operations on numeric arrays. –– scikit-image [Scikit-image] – processing of general-purpose rasters.
392 Recognition and Perception of Images –– RasterIO [Rasterio] – processing of raster representations of spatial data. –– matplotlib [Matplotlib] – scientific visualization. –– PySpectral [Spectral Python] – work with the multispectral and hyperspectral images. –– colorcet [Colorcet] – color spaces for visualization. –– ray [Moritz, Nishihara, Wang et al., 2018] [Ray Project] – library for calculations parallelizing. –– numba [Numba] – library for expedition of computationally intensive code on Python. –– pyproj [Pyproj] – library for projections conversion. –– PyTorch [PyTorch] – library for multidimensional numeric arrays processing, ANN models building and learning with the use of graphic accelerators (GPGPU, CUDA). –– cairo [Cario] – library used for vector graphics rasterization. Such programming languages as Cython, C, C++ and Rust are used for implementation of computationally intensive algorithms in rare cases. And cffi library is used for integration of procedures on them into programs on Python. This list is a subjective choice of authors. It by no means diminishes the achievements and qualities of other solutions and being presented exclusively for reference. The methods, techniques and algorithms described in the work may be equally implemented on the basis of other software applications. However, in case of their implementation with the use of other basic technologies it is necessary to consider the differences in their architecture and implementation.
10.3.2 Structuring and Accessing to Aerospace Datasets The practical building of machine learning models is impossible without standardization of datasets and providing the access to them. Not only numerous representation formats of single measurements, rasters and channel sets, but also numerous ways of arranging the collection of aerospace datasets are peculiar to ASI. Moreover, the work with ASI often suggests the complexing of heterogeneous data such as images made in the different spectral bands, data on the
Peculiarities of Aerospace Imagery Processing 393 state of atmosphere (wind, cloudage, etc.), vector cartographic materials, vector and raster labeling. This data may be provided by various suppliers. The part is obtained specially (by order or on its own). The enumerated factors form the conditions in which the researches cannot be carried out without solving the problem of data access unification. The obvious solution of this problem consists in the development of genuine data storage scheme designated for the particular task. Considering ASI, this approach may require the unreasonably large volume of data movement and conversion, which are additionally complicated due to the constant dataset replenishment. The other “ingenuous” solution consists in the adaptation of preprocessing procedures for the certain data types and representations. Considering ASI, this approach requires the implementation, testing, checkout and support of numerous implemented trivial procedures that slightly differ from each other. And the occurring of new or substitution of involved data sources results in considerable additional labour costs. Due to this the authors recommend to see about the building of abstractions between the procedures of ASI processing and means that ensure its storage and catalogization at the early stages of work. The authors offer the conception of “index files” describing the correspondence of identification and catalogization system accepted in the project and particular elements of dataset as the flexible and, at the same time, simple to implement solution. The authors selected YAML [YAML] serialization format as the compromise between the human and machine readability. However, it is equally possible to use JSON, XML and others. The data model applied by the authors includes the entities hierarchy: –– dataset – the set of data, –– item – element of dataset, –– channel – channel of dataset element. The available identifier that is unique within this hierarchy level and metadataset specific for project tasks is provided for every object type. The following listing introduces the example of dataset definition by means of notation used by the authors. The file paths are pointed out with due regard to the location of index file:
394 Recognition and Perception of Images As seen from the provided listing, it is possible to indicate the conversion parameters of data permitting the unification of channels apart from their actual location. As the process of such conversions is potentially time consuming, the authors recommend to cache their results. The application of this approach is supported by the small set of programming libraries. However, it is not standardized but rather created to ensure the easy adaptation to the realities of new projects.
10.3.3 Standardization of Measurements Representation The images of the current general-purpose computing equipment are predominantly represented by 1 byte (8-bit unsigned integer) per channel (in case of 3 (RGB) or 4 (RGBA) channels). Two bytes (16-bit unsigned integer) per channel as well as numbers with the floating decimal separator of single and double precision (4 and 8 bytes) within the range of 0-1 and -1-1 are used rarer to a great extent. The basic contemporary libraries used for the implementation of ANN models are oriented toward the work with numbers, which feature the floating decimal separator of single precision. This excludes the variability in this regard [PyTorch] [TensorFlow]. This variant is excessive for the most conversion algorithms that allows to avoid the accumulation of resampling errors. It is obvious that the smallest of variants, which doesn’t cause the accumulation of additional errors, is the optimal one for the storage. However, the problem consists in the difference of actual range of values peculiar to ASI from the provided encoding format. It is necessary to admit that ‘0’ value is typically used for the image areas, which are not occupied by the data. It shouldn’t be used for other purposes. The representations based on the numbers with the floating decimal separator may use for this either ‘0’ value or specially reserved ‘NaN’ value. Both cases require accurate and thorough accounting at all processing stages. The obvious and generally invalid approach to the normalization of the considerable part of image consists in the use of minimal and maximal values. The nonoperability of this approach is resulted from the outliers on the image, which are, for example, the result of under- and overexposure. The separate issue is represented by the nonlinearities occurring at all stages (from measurements and intermediate conversions to the displaying of images on the screen or printing). This problem is considered in numerous works, which regard the remote sensing, synthesis, storage and transmission of images [Earth Observing-1] [Humboldt State Geospatial online, 2019] [Poynton, 2003] [GammaFQA] [Sachs, 2003] [McKesson, Jason, 2013].
Peculiarities of Aerospace Imagery Processing 395 dataset: id: SOME DATASET meta: source: some satellite items:
- id: 2019-03-01-00x00
meta: time: 2019-03-01 00:00:00
channels:
- id: red meta: type: image wavelength: 680nm width: 4nm source: driver: filesystem format: tiff path: ./data/src/2019-03-01-00x00/band-680.tiff
- id: blue meta: type: image wavelength: 380nm width: 4nm source: driver: filesystem format: tiff path: ./data/src/2019-03-01-00x00/band-380.tiff resample: factor: 2 method: bicubic
- id: labels meta: type: labeling source: driver: filesystem format: tiff path: ./data/labeling/2019-03-01/00x00.tiff
396 Recognition and Perception of Images Ideally, the reference data are properly linearized. In this case the conversion of values range will be sufficient for their preparation for further processing by ANN model. And the visualization will also require the gamma correction. These conversions may be recorded in the form of the following dependence:
Dc ,i , j = g Sc ,i , j − Qmin ( M − e ) Q max
0,
Sc ,i , j = 0
e,
Sc ,i , j ≤ Qmin
M,
Sc ,i , j ≥ Qmax
+e ,
Qmin < Sc ,i , j < Qmax
where D – a variety of result pixels, S – a variety of source image pixels, e – the value which is minimally represented in the destination format, M – the value which is maximally represented in the destination format, g – gamma correction coefficient, Qmin and Qmax – values of percentiles, qmin and qmax calculated for all values of S (except of 0). The standard parameters of formula for the destination format of 8-bit unsigned integer numbers are:
qmin = 2 qmax = 2
e =1 M = 255 1 g= = 0.45(45) 2.2
The example of histograms of images before and after processing is shown in Figure 10.10. The histogram of image converted by means of the offered method demonstrates the effects propagated by the round-off errors. The authors didn’t define the degradation of ANN models performance while analyzing the data with such defects. Moreover, they didn’t have the considerable
Peculiarities of Aerospace Imagery Processing 397
Figure 10.10 Histogram of image before (from the left) and after (from the right) processing by means of the offered method. Top to bottom: red, blue and green channels.
impact in the course of visualization. If it is necessary to eliminate them it will be required to turn to different methods. ANN models proved to be resistant to the linearized or incorrectly linearized changes until these distortions are homogeneous within the dataset. If the set of data is formed from the different sources fraught with the various distortions their influence may be compensated at the stage of data augmentation for ANN training.
10.3.4 Handing of Random Channel/Spectral Image Structure The main rule of multichannel images processing is to preserve their perpixel alignment. Moreover, it is necessary to consider the possible existence of different types of channels. Apart from channels carrying the measurement data the image may include the channels with labeling and other service and support data. The supporting of these invariants in the course of ASI processing is carried out by the authors by means of specialized programming library focused on the charter of chainlike calls, which are the basis of API jQuery [jQuery] and SQLAlchemy Query API [SQLAlchemy]. The example of use is shown below where CALLABLE_A functor will be performed for all channels, CALLABLE_B – only for channels carrying the image data, and CALLABLE_C – for channels carrying the labeling data:
398 Recognition and Perception of Images Channels() .from_files( [ "red.tiff", "green.tiff", "lables.tiff" ] ) .map( CALLABLE_A ) .map_normal( CALLABLE_B ) .map_label( CALLABLE_C ) .map_normal_only( CALLABLE_B ) .map_label_only( CALLABLE_C )
"blue.tiff",
At the same time, the library controls the binary nature of map_normal and map_label calls thus excluding the possible errors. And the functions, which do not perform such control, i.e. map_normal_only and map_label_only, are provided for the functions that handle the channels of the only type. The most significant difference between the procedures applied to the labeling channels and channels with the measurement data is the use of resampling methods, which do not perform the values interpolation thus allowing to remain the number encoding in the labeling channels (e.g., by nearest neighbor method).
10.3.5 Ensuring of Image Sizes Necessary for Processing The current ANN topologies actively use the downsampling operations. In virtue of performance and differentiability, these operations implement the multiple downsampling with the multiplier 2. Thus, ANN models are operable only for the rasters every side of which is divisible by 2N, where N is a number of downsampling operations in ANN topology. In case of large-sized images and their fragmental processing it is possible to directly reference the dimension of fragment divisible by 2N. At the same time, when processing the small images it will be optimal to add the image along the edges according to formulae applied for every raster dimension:
F = 2N m P = F − m, F P Pbegin = 2 Pend = P − Pbegin
Peculiarities of Aerospace Imagery Processing 399 where N is a number of downsampling operations in ANN topology, m is a raster size by the current dimension, P is a total size of adding, Pbegin is a size of adding at the beginning of current dimension, Pend is a size of adding at the end of current dimension. It is obvious that processing result is to be subject to the symmetrical reverse conversion. The modern processing means offer the numerous methods of missed data compensation [Numpy.pad] among which it is possible to point out the use of constants, statistical functions in the values of boundary pixels and different variants of raster content expressing. According to the authors, it is necessary to consider for ASI that lack of data in the raster is natural and inevitable in this application environment, and is usually encoded with 0 or NaN values. The artificial padding of raster to the required size is a special case of this problem that must be considered in a similar way. To consider this problem it is required to train ANN model to correctly process the boundaries between the data and regions with no data. The correctness of processing in this case is determined by the peculiarities of the application environment and is generally interpreted by one of two ways: 1. to try to most certainly predict the occurrence and properties of the objects in no data region, 2. to consider the region with no data as the separate canvas.
10.3.6 Tile-Based Image Processing The tile-based image processing conception is not current and considered in the most works on the digital processing of signals, images and ANN methods as well [Ronneberger, Fischer, Brox, 2015] [Tile-based image processing] [Ahmed, Khan, 2013] [Wassenberg, 2011] [Wittenbrink, Somani, 1993]. Due to the large volumes the processing of numerous ASI at a time is impossible. The straightforward solution of this problem is a tile-based image processing that suggests the decomposition of image into many small fragments, the processing of separate fragments by means of ANN model and assembling of result. The apparent pattern which may be used for this procedure is a “checker pattern” (Figure 10.11). However, the cases when the data which require the processing occupy not the whole image but only its part are peculiar to ASI. In this case such an approach is irrational. To prepare the training samples the authors worked out the special procedure analyzed in the next section. Exclusively tile-based processing
400 Recognition and Perception of Images Ortorectified Image
Generated Fragments Raster Boundaries Void Areas on Raster Payload Areas on Raster Generated Fragments
Figure 10.11 Example of fragments placing by means of simple algorithm (checker pattern).
procedures used in the course of image interpretation are envisaged later on. Generally, the procedure of tile-based processing of the specified decomposition is as follows: 1. Get the source image S. 2. Create the raster D for result setting. 3. Create the raster A to calculate the number of fragments influencing the separate pixels of D raster. 4. Perform for every fragment e: 4.1. Convert f of s segment = S[e]. 4.2. Create the raster a matching the dimension s filled with 1.0 values. 4.3. Reset to zero the edge regions s and a of b width to minimize the edge artifacts. 4.4. Add s values to D[e]. 4.5. Add a values to A[e]. 5. Average the input of every element in the intersection region dividing the values collected in D into the values collected in A. It is obvious that if the available computational power is sufficient, the para. 4 may be performed concurrently for the different fragments. The variable part of this algorithm is a procedure of specifying of tile-based processing. The practical algorithm of such tile-based processing for ASI must: 1. Ensure the full cover of all valuable pixels on the raster.
Peculiarities of Aerospace Imagery Processing 401 2. Generate the fragments which extraction and implementation do not require the resampling or other conversions of their data. 3. Consider the existence of image areas without the valuable data (filled with 0 or NaN). 4. Ensure the minimization of boundary effects through providing the two-level overlap: –– excluded from result, –– averaged in the aggregation. 5. Minimize the reprocessing of the same image areas. If fulfilling these conditions the tile-based processing may be estimated with the help of goal function:
f (m, d ) =
|m| , |d |
where |m| is a number of processed pixels including the ghosts, |d| is a number of valuable pixels on the image. Thus, this function may be interpreted as the measure of excessive processing of selected fragmentation. The execution of such works in the divided environment of computational cluster requires special attention to the locality of data allocation in the memory. The predominant approach to this problem consists in the use of spacefilling curves [Drozdowski, 2009] [Frachtenberg, Schwiegelshohn, 2010], in particular of Hilbert [Hubert, 1891] and Peano curves [Peano, 1890]. The authors use the more simple parametrical procedure for the elementary cases (Figure 10.12). Its parameters embrace: –– fragment size divisible by 2 (in pixels), –– minimal overlap between the fragments (in pixels). Operational procedure: 1. Determine the main axis of data area on the raster. 2. Find the secondary axis (perpendicular to the main one in the center). 3. Define the minimal number of fragments along the main axis while considering the minimal intersection and indentation along the edges of data in the raster.
402 Recognition and Perception of Images
Raster Boundaries Void Areas on Raster Payload Areas on Raster Generated Fragments Axis
Figure 10.12 Example of fragments placing with the use of offered algorithm.
4. Determine the minimal number of fragments along the secondary axis while considering the minimal intersection and indentation along the edges of the raster. 5. Build the certain number of fragments. 6. Minimize the step while keeping the indentation along the edges of data in the raster for every group of fragments along the main axis. 7. Minimize the step while keeping the indentation along the edges of data in the raster for every group of fragments along the secondary axis. 8. Delete the fragments with no data. In case when ASI contains several unconnected data areas each of the connected components is processed separately.
10.3.7 Design of Training Samples from the Aerospace Imagery Sets The contemporary practice of ANN models training for further images interpretation depends on the augmentation procedure to a great extent [Simard, Steinkraus, Platt, 2003] [Wong, Gatt et al., 2016] [Imgaug] [Perez, Wang, 2017]. The augmentation is a procedure of adding of controlled distortions in the images which is built on the top of (pseudo) random generator and repeated many times for every image from the dataset. This procedure permits the considerable increasing of representativeness of the dataset. Additionally, in the course of augmentation the images may be added with the distortions which simulate the defects of images obtaining and preprocessing, i.e., under- and overexposure, flares, channels incompatibility, geometric distortions, etc.
Peculiarities of Aerospace Imagery Processing 403 The augmentation is particularly appropriate in relation to ASI specifics (the limited data availability, regular availability of defects on the images). Except for augmentation the preparation of training samples must solve many other tasks, in particular: 1. To ensure the necessary size of training samples, 2. To ensure the necessary number of training samples (by means of augmentation as well), 3. To compensate the areal and quantitative disbalance of objects in the training sample, 4. To compensate the unequal information richness of images. The authors offer the special parametrical algorithm of preparation of training sample from ASI that solves all specified tasks within one procedure. It is possible due to the usage of ASI peculiarities, and mainly of off-nadir pointing survey that makes the image interpretation invariable to rotation and perspective distortions. The functional diagram of offered procedure is shown in Figure 10.13. This approach allows to simultaneously: Source Dataset Overall statistics
Source Dataset Item Bands
Labels
Expert marks
Significance map
Quadrangle Subset
Candidate quadrangle set
Sample Extraction
Sample Augmentation
Random Generator
Per Quadrangle Operation Seed Generated Dataset
Figure 10.13 Functional diagram of generation of training and validation datasets from aerospace imagery sets.
404 Recognition and Perception of Images 1. Extract the set number of samples of the set size, 2. Integrate the geometric distortions (along with the extraction of sample from the large-sized image), 3. Compensate the problems of whitespaces and classes disbalance. A similar approach was originally offered by [Gvozdev, Murynin, Richter, 2019]. Its extended and updated version is presented in this work. Significance map generation is a key stage of the offered procedure. Its basic version includes the consideration of the general statistics of dataset, the local entropy of images, labeling and auxiliary data on the importance of separate sections. Depending on the specifics of the current task this subprocedure may be extended and added with the consideration of other components. The functional diagram of significance map generation is shown in Figure 10.14. Candidate Quadrangle Set Generation is implemented trivially, i.e., through generation of multiply excess number of distorted quadrangle. The distortions set includes rotating, mirroring to one or two axes, perspective distortions. However, they are not limited with this set. Quadrangle Subset Selection is a special case of classical set cover problem [Vazirani, Vijay, 2001].
Computed Data Local Entropy
Source Dataset Item Bands Data
Labels
Significance Map Fusion
Overall Source Dataset Statistics
Significance Map
Figure 10.14 Functional diagram of significance map generation subprocedure.
Peculiarities of Aerospace Imagery Processing 405 The authors defined that even trivial greedy algorithm copes with this problem very well (in case of further corrections). After excluding the region of selected quadrangle from the significance map not its full area but the central part of the set area is used (by default – 80% of quadrangle). The exclusion is carried out through multiplying the relevant area of significance map by k = [0,1) coefficient that allows to regulate the number of quadrangles (n ≥ 1) in which this area may be applied when k = n−1. The search is stopped when the set number of selected quadrangles is achieved or if the total remaining significance is lower than the set threshold. Quadrangle Sample Extraction is also carried out trivially. It is necessary to admit that it is impossible to apply the interpolation for the channels with the labeling, which features the coded numbers. Sample Augmentation entirely depends on the nature of data and current task. The authors offer the following and hardly full list of distortions: 1. Adding of the random noise. 2. Adding of the structured noise (e.g. Perlin noise [Perlin, 1985] [Noise and Turbulence]). 3. Shifting of channels against the basic one. 4. Blur. 5. Color correction: 5.1. Shifting of black and white points. 5.2. Gamma correction. 5.3. Change of color temperature. Depending on the channel/spectral structure and dataset peculiarities, the distortion data may be filled up both symmetrically (for all channels) and separately for every channel. Moreover, it is possible to apply some specific distortions for the multiand hyperspectral distortions: 1. Application of transfer functions in the spectral domain. 2. Setting of certain channels to zero or their substitution by the noise. 3. Channels interchanging. The procedures simulating the defects of initial data, which are peculiar to task, are to be used just at this stage. The process of extraction and augmentation of the separate sample is shown in Figure 10.15.
406 Recognition and Perception of Images Source Image
Quadrangle
Augmentation
Random Generator
Generated Image
Seed
Figure 10.15 Process of extraction and augmentation of the separate sample from the source image.
The same procedure may be used for the generation of validation samples. Besides, a similar procedure may be applied for implementation of Self-supervision learning technique, which suggests the conversion of the introduced transformations for the results of work of ANN model as well as formation of the single labeling for the source image by means of averaging. In the course of procedure design and implementation it is important to ensure its reproducibility through initialization of random number generator by means of the known and definitely set value.
10.4 Interpretation of Aerospace Imagery by Means of Artificial Neural Networks ASI peculiarities mentioned above and, mainly, its arbitrary channel/ spectral structure restrict the application of the majority of pretrained models and of knowledge transfer family methods. The limited ASI accessibility, in its turn, restricts the applicability of pre-configured ANN topologies. The number of their free parameters allows them to completely
Peculiarities of Aerospace Imagery Processing 407 “remember” all training samplings before occurring of generalization effects, thus demonstrating the overfitting phenomenon. The demand to integrate the data with the different spatial resolution and also with the random channel/spectral structure leads to the necessity to adapt ANN topologies practically to every new task. Thus, ASI processing, on the one hand, is impossible without involvement of contemporary achievements in the field of ANN building and training. But, on the other hand, it sets the problem for researchers to operatively adapt these achievements both to overall ASI specifics and to the peculiarities of the separate tasks. Finally, the ready topologies may not satisfy the constraints imposed by the application environment. To solve this problem the authors systemized and generalized their experience of building ANN topologies used for ASI interpretation as well as found out the most successfully used design patterns and available variants of their modifications for adaptation to the various requirements. This allowed the formation of the ANN topology framework. To rationalize the implementation of developed topologies the authors also studied the issues relating to the designing of ergonomic program interfaces which ensure the support of designed framework on the basis of PyTorch library of machine learning [PyTorch].
10.4.1 ANN Topologies Building Framework Used for Aerospace Imagery Processing In order to process ASI, the basic ANN topology must consider the larger part of content on the one hand and be capable to save the small (1-2 pixels) details on the other. U-Net [Ronneberger, Fischer, Brox, 2015] and U-Net++ [Zhou, Siddiquee et al., 2018] topologies which were initially designed for the interpretation of medical images proved themselves by just that property. LinkNet [Chaurasia, Culurciello, 2017] and The One Hundred Layers Tiramisu [Jégou, Drozdzal et al., 2016] topologies are based on the same principle (skip-connections). The formation of features pyramid implementing the multiscale representation of latent space is a basis of these topologies. The scale matches against the source image at zero point, and every next level corresponds to the double decreasing of image side. Thus, the latent scale equal to 2–n is represented at n ≥ 0 level. The legend of the following figures is shown in Figure 10.16. The model of ANN topology offered by the authors is a generalized variant of U-Net [Ronneberger, Fischer, Brox, 2015] topology shown in Figure 10.17.
408 Recognition and Perception of Images Concat
Conv2D
C×(W×H) Linear
C
Channel-wise concatenation
Encoder
Element-wise sum
Decoder
Element-wise product
Adapter
Arbitrary nonlinearity
Mix-in operation
Sigmoid-like nonlinearity
Downsample operation
ReLU/ELU/… nonlinearity
Upsample operation
2D Convolution layer
Basic block (any)
Siginle Linear layer
Group of basic blocks
with C channels and W×H kernel
with C channels
Figure 10.16 Reference designations used by the authors on ANN topology diagrams.
3
scale level
Encoder Secondary Input Adapters
Decoder
2
scale level
1
scale level
Output Adapters
0
scale level
Primary Input Adapter
Medium Adapter
Figure 10.17 Basic ANN topology offered by the authors: generalization of U-Net, U-Net++, LinkNet and other mechanisms.
Peculiarities of Aerospace Imagery Processing 409 Each of the base blocks of this model is parameterizable that allows the modification of their behavior in a wide range without change of the basic logic and program code implementing it. The main parameters include: –– –– –– –– ––
number of scale levels, type of base blocks, data mix-in procedure, downsampling procedure, upsampling procedure.
All parameters except the number of scale levels are passed in the form of factories (object-oriented programming pattern known as “factory”). Moreover, to support the numerous scales of input data the encoder supports the input data mix-in at all scale levels. Similarly, to support a lot of heterogeneous output data the decoder supports the branch at the beginning of any scale level. The intermediate adaptor allows to expand the logic of data conversion between the encoder and decoder just being armed with the complete pyramid of latent representations thus providing an opportunity to implement it over this framework (e.g., analog of U-Net++ topology) [Zhou, Siddiquee et al., 2018]. Every network input is equipped with the input adapter formed according to the structure of input data. Every network output is provided with the output adapter selected with due regard to the task solved on this decoder branch. The survey of single blocks which operating capability is adapted by the authors within the offered framework is presented later on.
Types of base blocks Many base blocks for ANN topologies are designed today. According to the authors’ experience two of them are worthy of special attention due to their relative simplicity and stability of shown results. They embrace the base blocks of ResNet [He, Zhang et al., 2015] and Inception-ResNet-v2 topologies [Szegedy, Ioffe et al., 2016] which are offered by the authors for using within the designed framework that features no modifications. Besides, the authors offer to modify the base block of InceptionResNet-v2 for its further application in small topologies (e.g., Inception-G). Its functional diagram is shown in Figure 10.18.
410 Recognition and Perception of Images Input Conv2D
L×(1×1)
Conv2D
L×(1×1)
Conv2D
L×(3×3)
Conv2D
2L×(3×3)
Conv2D
L×(1×1)
Conv2D
L×(1×7)
Conv2D
2L×(7×1)
Concat Conv2D
5L×(1×1)
Output
Figure 10.18 Inception-G base block, i.e., modification of base block of InceptionResNet-v2 topology offered by authors for small topologies.
The opportunity to choose between the base block of these topologies allows to give the priority to the network compactness and learning speed (ResNet), the quality of obtained result (Inception-ResNet-v2) and the quality of obtained result together with the training data deficit tolerance (Inception-G). The latter is especially important in cases of application of objects with the complex spatial structure and fine details (country roads and some types of constructions). If implementing the data of base blocks it is necessary to pay special attention to the possible parametrization of used nonlinearity functions (nonlinearities) which will be considered in detail later on. The second way of optimization, i.e., so-called separable convolutions, consists in the substitution of complex convolution operations for more simple combinations. It is already used in the diagram shown above (Figure 10.18) where the convolution 7x7 is replaced with the set of 7x1 and 1x7 convolutions thus considerably decreasing the computational complexity and ensuring modest quality loss. These ways of ANN topologies building are currently studied, for example, in the works of Lin, Chen, Yan, 2014] [Szegedy, Liu et al., 2014] [Chollet, 2017] [Xie, Girshick et al., 2017] [Zhang, Zhou et al., 2018] and many other works. The data survey by [Bai, 2019] is worthy of special attention.
Peculiarities of Aerospace Imagery Processing 411 In the authors’ opinion, the pointwise convolution that is depthwise separable without the intermediate nonlinearity function (as offered by [Chollet, 2017]) is the most perspective one.
Downsampling Block Types 2D Max Pooling procedure and Reduction-B block from [Szegedy, Ioffe et al., 2016] were successfully used by the authors within the offered framework as the downsampling blocks.
Upsampling block types Considering the possible occurrence of “Checkerboard Artifacts” in case of application of the methods based on the convolution, the authors offer to use only nearest neighbor upsampling in the course of ASI processing. This issue is studied in detail by [Odena, et al., 2016].
Nonlinearity function types The predominant nonlinearity function used today in the tasks of patterns interpretation is ReLU [Hahnloser, Sarpeshkar et al., 2000]:
f(x) = max(0, x) Its main and unmatched advantage consists in its incredible simplicity and productivity. In particular, it may be performed without copying to the memory. Its major disadvantage is a so-called “problem of dying neurons” [Lu, Yeonjong et al., 2019]. The compromise solution of this problem is represented by Leaky ReLU function offered by [Maas, Hannun, Ng, 2013]. Many alternative solutions are offered at this time: GELU, SELU, ELU and others. Their detailed survey is represented in the work of [Hansen, 2019]. According to the author’s opinion, it is necessary to consider the possibility to turn the base block with ReLU into ELU for the small datasets [Clevert, Unterthiner, Hochreiter, 2016]. It will systematically result in the acceleration of convergence and better quality indicators. The apparent disadvantage of ELU with regard to ReLU is its higher resource capacity.
Mix-in function types The mix-in in the context of considered framework is a procedure of merging of multidimensional arrays with the equal size in spatial dimensions that possesses the representation of features in the latent space.
412 Recognition and Perception of Images The mix-in may be used by the encoder to merge the data from the different sources or of various scales as well as by the decoder to merge the data of neighbor scale levels after adapting their spatial resolution by means of upsampling procedure. The authors offer to use the following mix-in procedures within this framework: 1. concatenation in the channel dimension, 2. addition, 3. addition followed by Pointwise convolution, The first method is used in the original U-Net topology on the ideas of which this framework is based [Ronneberger, Fischer, Brox, 2015]. The second method is used in LinkNet topology [Chaurasia, Culurciello, 2017]. It is more effective and converged in the faster way. However, it systematically assents to variant 1 by the result quality. Moreover, it requires the equal number of channels in input data arrays. The third method is a compromise one. In particular, the residual connections are realized in ANN ResNet topology with its help [He, Zhang et al., 2015]. As the total indication the authors recommend to use the first method to achieve the highest recognition rate if there is no memory shortage and lack of computational performance.
Loss functions This framework is oriented toward the solving of tasks of semantic segmentation, i.e., the estimation of probability of belonging of every pixel to one or several classes. Generally, the classes are mutually exclusive in the semantic segmentation tasks. The loss function estimating the cross entropy is successfully used for training of ANN models of such type [Mannor, Peleg, Rubinstein, 2005]: C
CE =
∑t log( f (s )) i
i
i
where ti – target value, si – ANN model output, C – set of classes, f – nonlinearity function (e.g., softmax). In cases when it is possible to formulate the task through extraction of several groups of mutually exclusive classes, ANN topology may be
Peculiarities of Aerospace Imagery Processing 413 extended while adding the second output adapter. At the same time, both output adapters will be trained on the basis of cross entropy loss. The example of such situation may be represented by the necessity to simultaneously classify: –– type of geological substrate: asphalt, rammed concrete, ground coat, grass, etc. –– structural type of object: road, back yard territory, park, etc. The building of single ANN model ensuring the simultaneous compliance with these two classifications is justified because the interpretation of objects according to the second categorization is impossible without defining the first one. At the same time, the knowing of object belonging to the second categorization may specify its classification in the first. Moreover, it is possible to successfully apply the other loss functions such as Dice Score, Jaccard Index [Bertels et al., 2019] and Focal Loss [Tsung-Yi Lin et al., 2017] used for the interpretation of medical images. The application of Focal Loss is worth of the special attention in the tasks of small (1-2 pixels) details detection.
Regularization The authors successfully used the classical methods of regularization within this framework: –– Dropout [Srivastava et al., 2014], –– Batch Normalization [Ioffe, Szegedy, 2015], –– L2 [Cortes, Mohri, Rostamizadeh, 2012]. The authors did not manage to define the significant differences in application of this regularization methods to ASI processing.
10.4.2 Object Non-Locality and Different Scales Within the scope of digital methods of signal processing, the locality is a considerably greater influence of closer (with respect to space and time) references (metrics) on that considered on its interpretation. In case of images consideration it is possible to say that the closer pixel to the concerned one the greater influence it exerts on its interpretation. This is true for ASI when recognizing the materials or objects.
414 Recognition and Perception of Images However, it often comes in practice to interpret the large areas of image containing many instances and phenomena with the complex spatial structure. The specific examples are represented by the interpretation of backyard territories, parks, multi-level junctions, oil spills on water surface, erosion phenomena, etc. Their interpretation is peculiar by the non-locality, i.e., reducing dependence between the proximity of signal elements and their interpretation. The base U-Net topology itself features the capability to cope with some practically significant cases. However, the convolution methods generally do not solve the non-locality issue, and U-Net is not an exception as well. In the course of study of this issue, the authors found out several methods which may qualitatively improve the capability of ANN models to interpret the nonlocal instances and phenomena.
Source image downsampling The source image downsampling is frequently used in the course of ASI processing. Its application may be considered as obligatory in case of blurred images or images which signal-noise-ratio is more than 2-4 pixels. Typically, the quality of such images may be estimated “by the eye”. However, there are also formal methods such as those described in the works of [Tsomko, Kim et al., 2008] [Pertuz, Puig, García, 2013] [Rosebrock, 2015]. This method may improve the accuracy of ANN model and reduce the influence of non-locality problem. However, it is not a solution of this problem.
Scale levels number increase The number of scale levels in the topologies of U-Net [Ronneberger, Fischer, Brox, 2015] or LinkNet [Chaurasia, Culurciello, 2017] type is limited only by the source image size. It is the apparent approach to deal with the non-locality. It is carried out by means of increasing the number of downsampling layers in the encoder block as well as upsampling layers in the decoder block with the simultaneous adding of base blocks between them. This leads to the enlargement of ANN model, its computational complexity and used memory capacity. Thus, the use of this method is limited by GPGPU capabilities. On the other hand, the experiments demonstrate that the dependence of training quality indicators and number of scale levels is of nonlinear nature that is significantly different for various applications. Due to this,
Peculiarities of Aerospace Imagery Processing 415 the authors do not recommend to considerably increase the number of scale levels for classical U-Net. The standard is 3-5 levels, and maximum – 6 levels. This method also may reduce the influence of non-locality problem, but it is not a solution either.
“Squeeze-and-Excitation” The Squeeze-and-Excitation method is presented in the work of [Hu, Shen et al., 2017]. It consists in the detection and distribution of global dependences between the channels. The adaptability to any type of base block is its main property (shown in Figure 10.19). This method show stable and good results at the low computational complexity. Therefore, it is recommended to be used. Apart from the non- locality problem it allows to reduce the influence of variability of image global properties such as illumination, white balance, shadows type, etc.
“Stacked Hourglass” The “Stacked Hourglass” method is offered in the work of [Newell, Yang, Deng, 2016] in regard to the suggested framework. In consists in the multiple repetition of encoder-adapter-decoder group.
Input
Channel-wise Average Linear
“Squeeze” operation
Linear
“Excitation” operation
L/8
L
Output
Figure 10.19 Functional diagram of Squeeze-and-Excitation adding to any base block.
416 Recognition and Perception of Images To overcome the problem of vanishing gradient, learning rate and stability it is offered in the work of [Newell, Yang, Deng, 2016] to use the “intermediate supervision” through inclusion of interpretation results, which were obtained from every intermediate decoder into the loss function. In the terms of offered framework this will require the adding of output adapter after each encoder-adapter-decoder group. The functional diagram of topology built according to this principle is shown in Figure 10.20. This approach is incredibly robust and scalable (through increase of number of encoder-adapter-decoder groups) with regard to the nonlocality problem. Unfortunately, its application is restricted by the large volume of resultant ANN model consequently requiring the large RAM capacity of GPGPU.
Supplemented encoder context The “Supplemented encoder context” method is specially developed by the authors for the tasks of ASI interpretation. It is carried out after training ANN model in the standard mode. General procedure (Figure 10.21): 1. Remove the input adapter and encoder from the model trained in the standard procedure. 2. Freeze their free parameters excluding from the operation of optimization algorithm. 3. Create the new intermediate adapter and decoder. 4. The training and classification are to be carried out in the following way: 4.1. Use the trained input adapter and encoder to form the latent representation of all image fragments. 4.2. Produce the enlarged groups of fragments. For example, by 4, 9, 16 base fragments. The fuzzy matching and overlay are allowed but this must be considered in para. 4.4 4.3. For each group of fragments:
Supervision
Figure 10.20 Functional diagram of topology built according to the stacked hourglass principle.
Peculiarities of Aerospace Imagery Processing 417
Source Image
Splited to fragments
Pretrained Input Adapter And Encoder
Latent representation in high resolution
Concat
Downsampling
Latent representation in low resolution
New Medium Adapter, Decoder and Output Adapter
Upsampling
Inferrence result
Figure 10.21 Functional diagram of supplemented encoder context method.
418 Recognition and Perception of Images 4.3.1. Provide their joint latent representation by means of concatenation. 4.3.2. Downsample the obtained latent representation (e.g., by means of MaxPooling procedure). 4.3.3. Carry out the training/classification procedure while using the result of para. 4.3.2. 4.3.4. Turn the downsampling into 4.3.2 through reverse upsampling of 4.3.3 result (e.g., by means of nearest neighbor upsampling). 4.4. Recover the result of classification using the fragments. Thus, the decoder features the much greater spatial reach than the encoder. It may be obviously necessary to increase the decoder complexity formed at the step 3. According to the authors’ experience this approach may cope with the interpretation of instances and phenomena which size exceeds the size of image fragment recognized at a time. In this case the pixel-by-pixel recognition rate is 4-6% ahead of the trivial reduction of input data size. This method is successfully combined with other methods and, particularly, with Squeeze-and-Excitation. However, to ensure its application it is necessary for the decoder to learn the important base features at the first stage of training. This restricts the applicability of method for the large homogeneous objects, which possess no expressed structural features except on the edges.
10.4.3 Topology Customizing to the Different Channel/Spectral Structures of Aerospace Imagery The numerous experiments of the authors showed that the topology offered within this framework demonstrated the stable results in case of training at the various channel/spectral structures (beginning from the hyperspectral to monochrome ones including the service channels with the rasterized vector data, survey conditions, etc.). The similar properties are peculiar to many other studied ANN topologies. It is understood that ANN models trained on one channel/spectral structures are not applicable to others. Thus, the topology customizing to the different channel/spectral structures via proposed framework comes to the building of input adapter with the relevant number of input channels. The other case is of interest. It is represented by the necessity to use the single ANN model applied to the different channel/spectral structures. Such a situation often occurs if several heterogeneous data sources, each of
Peculiarities of Aerospace Imagery Processing 419 which cannot present the qualitative model even by means of augmentation, are available. The further presentation assumes that the spatial resolution of different sources is adjusted or its differences are considered at the stages of data preprocessing and training sample building (augmentation). The simplest degenerated case is that ANN model is trained only on the channels available for all output channel/spectral structures. That means to use their intersection. Apparently, such an approach does not allow to use the advantages of heterogeneous data and also cannot consider their peculiarities in totality. The similar properties are peculiar to the method, which bases on reduction of all variations of channel/spectral structures to the one by means of some trivial routine. Thus, it is easy to acquire the monochrome image from RGB. Therefore, if the channel/spectral structures differ slightly the more qualitative result may be obtained not from the intersection but from their union. In this case it is ideal at the stage of augmentation to predict both the synthesis of missing data (if possible), their substitution by the constants or noise and lack of data in channels, which must become available in the operating mode only simultaneously. For example, one channel/spectral structure features the form:
( NIR, M ),
where NIR - near infrared, а M - monochrome, and other:
( R, G, B ). Their uniting will be represented by channel/spectral structure:
( NIR, M, R, G, B ). At the same time, the image of the first structure will be worked out to
( NIR, M, 0, 0, 0 ),
and the second - to
( 0, 0, R, G, B ).
420 Recognition and Perception of Images In this case if M = ( R + G + B )/3, it will be possible to generate the following channel/spectral structures at the stage of augmentation to improve the efficiency of training: ( NIR, M, H, H, H ) ( H, M, R, H, H ) ( H, M, H, G, H ) ( H, M, H, H, B ) ( H, M, R, G, B ) ( NIR, M, M, M, M ) ( H, M, M, M, M ) ( H, M, M, H, H ) ( H, M, H, M, H ) ( H, M, H, H, M ) ... where H is a channel filled with uniform constant value (-1,0,+1) or random data. However, this approach is imperfect in case of small or completely missing intersection between the channel/spectral structures or in case of the great number of channels in their intersection. The authors researched as the alternatives: 1. Training of separate input adapters for every channel/ spectral structure. 2. Training of separate input adapters and encoders for every channel/spectral structure along with the transfer of base load on the decoder. 3. Training of the single input adapter that is universal for all channels (the spectral coverage of this channel is coded and transmitted to adapter together with the channel data) with the further aggregation of all inputs by means of operation invariant to the order (in particular, addition). In this case, the input adapter must generate far more (in 2-3 times) channels of latent representation than the input image channels may be. The conception of the third method is based on the author’s interpretation of methods presented in the works on geometric deep learning [Bronstein, Bruna et al., 2017] [Monti, Boscaini, Masci et al., 2016]. Each of these methods showed its principal efficiency. However, the authors do not have the sufficient amount of labeled training data with the
Peculiarities of Aerospace Imagery Processing 421 heterogeneous channel/spectral structure to obtain the definitive results as to the quality indicators of these methods.
10.4.4 Integration of Aerospace Imagery with the Different Spatial Resolution The task of data integration from the different sources is often met in the practice of ASI interpretation. The most frequent and insistent difference is a spatial resolution difference. The obvious way to eliminate such difference is to align the spatial resolution of data sources through application of downsampling and upsampling procedures, which may be trivial (based on interpolation) and more specialized (e.g., on the basis of pansharpening) [Mitianoudis, Tzimiropoulos, Stathaki, 2010] [Liu, Wang, 2013]. However, if the spatial resolution of data differs more than in 2 times such approach will be needless both when it comes to the resources consumed by the ANN model and resources expended in the preprocessing. The framework offered by the authors solves this problem through data mix-in from the additional encoder source. It also may require the adjustment of spatial resolution (but to an accuracy of 2N multiplier). This opportunity is shown in Figure 10.17 by means of data integration through mix-in blocks and secondary input adapters. If the additional data feature the complex substructure it is possible to integrate some typical base blocks after the input adapter. The authors did not find out the considerable changes of quality indicators on application of this method with regard to the mentioned method and complete adjustment of spatial resolution of data sources.
10.4.5 Instance Segmentation The task of instance segmentation is one of the main ones in the course of interpretation of images and aerospace imagery as well. Unlike the semantic segmentation this task suggests the formation of unordered set of instances. This is a meaningful constraint in itself. The base structures of current ANN topologies successfully deal with the sequences (recurrent networks) or multidimensional arrays (convolution networks) but not with the unordered sets. At the fundamental level, this issue is studied within geometric deep learning thread [Bronstein, Bruna et al., 2017] [Monti, Boscaini, Masci et al., 2016] that envisages the application of ANN models for the sets, graphs and other structures. In practice, the approaches to the problems are based on the
422 Recognition and Perception of Images working out of different ways to encode the object representations compatible with the dominant ANN primitive blocks. The Region Proposal Networks family of methods is the most developed method to solve the instance segmentation task [He, Gkioxari et al., 2017] [Ren, He et al., 2017] [Redmon, Divvala et al., 2016]. Their basis is represented by the generation of numerous object candidates filtered and updated further on, which, in its turn, may be carried out by means of the additional ANN model. The object candidates are represented in the form of parameters of antecedently identified anchors on the image. The original positioning of these anchors is a compromise parameter. Their high density decreases the general efficiency of diagram and increases the chances of false responses when the low density results in the ineffective detection of small and closely spaced objects by the diagram. The alternative class of instance segmentation methods is based on the integration of the middle state into the model (e.g., the coordinates of the anchor for instance search [Sofiiuk, Barinova, Konushin, 2019] or involvement of the recurrent ANN topologies [Salvador, Bellver et al., 2017]). In spite of the high exploration degree both these classes have the significant disadvantage with regard to ASI interpretation task. They are limited by the detection of large area objects with no huge emptiness, which do not come into conflict for the single anchors and cannot be adapted for application to linear and points objects. To get these acts together the authors offer to refer to one more class of instance segmentation methods based on the generation of output data in the form, which are trivially processed by means of the classical methods. It is possible to admit among them the methods based on the instance embedding conception, i.e., assigning of coordinates to every image pixel in the feature space in which the clustering seems to be trivial [Brabandere, Neven, Van Gool, 2017]. The other approach is a generation of “energy” fields for further use within the “watershed” method [Bai, Urtasun, 2017]. The authors offer to use just this method while uniting ANN models trained within this approach to approximate the “energy” fields not only for area, but also for linear and point objects. Concerning the area objects, the energy is defined as the measure of proximity to the object center, and as a measure of proximity to the nearest point of object for the cases of point and linear objects. More generally, the object center means the central line, i.e., the skeletonization result [Zhang, Suen, 1984] [Lee, Kashyap, Chu, 1994]. Thus, the localization of point and linear objects is limited by the search of local maximums (Figure 10.22). Besides, the use of this class of methods allows to apply the other available achievements within the topological framework offered by the authors. In the course of implementation of this method, the authors recommend
Peculiarities of Aerospace Imagery Processing 423
Figure 10.22 Examples of geometric structures (above) and “energy” fields for them (below).
to refer to the original work when it comes to the issues relating to the loss functions and process of training data preparation [Bai, Urtasun, 2017]. In case of ASI, this method proved itself when all objects within the class feature the similar linear dimensions on ASI. It is possible to face the problems in the course of method fine tuning when the dimensions variability is increased. The increasing of accuracy of method working on is possible through transition from the approximation of the single measure of proximity to the center (central lines), i.e., the “energy” for approximation of more characteristic parameters set. This means the shifting to the center along two axis or their integration into the latent instance embedding. These approaches are applied and developed in the works of [Neven, De Brabandere et al., 2019] [Zhang, Wang et al., 2020]. The applicability of their results to ASI is studied by the authors at the present time.
10.4.6 Learning Rate Strategy Learning rate is one the most important hyperparameters for the process of ANN model training. Too high values lead to the insufficient qualitative result, and too low ones – to the fast transition to the local minimum. At the present time, the approach based on the application of automated values management policy (learning rate) is a predominant one [Zulkifli, 2018] [Lau, 2017].
424 Recognition and Perception of Images In the principal cases, the learning rate management policies envisaged in the mentioned works result in the predictable and qualitative results. However, the authors faced with the great fluctuations of quality indicators on the validation sample in certain cases characterized by: 1. the objects of interest with the poorly resolved substructure (areas of waste storage, erosion processes), 2. the controversial labeling of the training sample, 3. the low volume of training sample (up to 200 samples before augmentation). The authors developed the special training procedure known as forking learning that combined the early stopping [Brownlee, 2018] and learning rate decay techniques for such cases. The functional diagram of this procedure: 1. Initialize ANN model by means of random values. 2. Carry out the learning step. Compute and store the quality indicators. 3. Carry out the validation step. Compute and store the quality indicators. 4. Determine the best global version of model by the quality indicators at the validation step. 5. If the best global version of model was obtained more than N training epochs ago: 5.1. Download the best global version of the model. 5.2. Decrease the learning rate due to the selected algorithm. 5.3. Go to step 2. 6. If the best global version of model was obtained more than M training epochs ago it would be necessary to end the process. If the process is ended the row of models which feature the best quality indicators at the validation step may be subject to the in-depth automated or manual analysis. The illustration of the training process according to forking learning procedure is shown in Figure 10.23.
10.4.7 Program Interfaces Organization ANN topology framework proposed by the authors for ASI interpretation is absolutely theoretical construction. Its practical implementation is to be
Peculiarities of Aerospace Imagery Processing 425 Validation accuracy
1st branch LR=0.100 2nd branch LR=0.010 3rd branch LR=0.001 4th branch LR=0.010 Branching points Overall top candidates
Epochs
Figure 10.23 Dynamics of training process carried out according to forking learning procedure.
based on the relevant support from the side of software platform of ANN model building. Although the available means of software implementation of ANN models both those realizing the imperative paradigm (e.g., PyTorch [PyTorch] and TensorFlow [TensorFlow]) and realizing the declarative paradigm (e.g., Keras [Keras Documentation]) allow to implement a lot of ANN topologies but they do not provide the reasonable opportunity to build the parameterizable models. Thus, the substitution of nonlinearity or adding of regularization of Dropout type supposes the interference in the program code used for the base topology implementation that considerably reduces the possible reusability of created solutions. The detailed studying of these issues leads to the more fundamental problem that consists in the building of ergonomic, expressive and complete (due to the granting the access to the system features) program interfaces [Mayorov, Gvozdiev, 2014]. In the authors’ opinion this problem is substantially undervalued in the field of ANN models implementation. In relation to the study of ANN models application for ASI interpretation the authors began to envisage the development of API organization principles to model ANN topologies, which may completely ensure the needs of the developed topological framework. The main program libraries used to build and train ANN models are provided by API that is accessible from Python programming language.
426 Recognition and Perception of Images As a consequence, the first principle of necessary API building consists in the effective use of capabilities of Python programming language. Python is known for its low productivity that is compensated by means of the introduced second approach. It consists in the separation of modeling (characterizing, producing) phases (stages) and computations performing. The third principle ensures the key difference of offered API from the existing ones (i.e., ubiquitous parametrization, in particular, by means of dependency injection). The fourth principle is a minimization of redundancy. This means that all that may be determined automatically must be defined automatically, and all that must be set externally must be set only once and only in one place. It is the answer to API PyTorch peculiarities, which particularly requires the indication of quantity of channels both at the input and output of every layer. This leads to the redundancy which the authors tend to resolve. The fifth principle is a principle of transparent abstractions that was originally offered by [Gvozdiev, 2015]. It consists in the fact that abstractions must hide the complexity of their internal arrangement but not the complexity of abstractions, which are base with regard to them. As a result of application of these principles the authors implemented the two-level program interface where the first level allows to model ANN topology in the form of a graph, which is ratcheted up from the inputs to the outputs by means of pointers. And the second level allows to automate the high-level buildings on the given graph. This abstraction is based on the following entities: Graph (adapter) is a graph model of ANN topology where adapter is an entity coupled with the base library. In the further example, adapter = PyTorchAdapter(), and PyTorch is used as the base library. Model is a result of Graph( ... ).build() operation, which is represented by the ready-to-work ANN model used by calling (it implements __ call__ interface): graph = Graph( PyTorchAdapter() ) # … building graph … model = graph.build() result = model( { input_id: output_data } ) # where result[output_id] is output_data
Peculiarities of Aerospace Imagery Processing 427 Pointer is a graph node pointer created by calling Graph( ... ).input( input_id ), that ensures the access to the creation of graph structure through nodes expansion. It performs the following operations: p.fork() – creation of the second pointer on the same graph node. p.join( p2 ) – uniting of several pointers into MultiPointer. p.output( output_id ) – the marking of current node as returnable due to the results of model computation. p