Handbook of Image Processing and Computer Vision: Volume 3: From Pattern to Object [1st ed.] 9783030423773, 9783030423780

Across three volumes, the Handbook of Image Processing and Computer Vision presents a comprehensive review of the full r

246 71 18MB

English Pages XXIII, 676 [694] Year 2020

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Front Matter ....Pages i-xxiii
Object Recognition (Arcangelo Distante, Cosimo Distante)....Pages 1-192
RBF, SOM, Hopfield, and Deep Neural Networks (Arcangelo Distante, Cosimo Distante)....Pages 193-260
Texture Analysis (Arcangelo Distante, Cosimo Distante)....Pages 261-314
Paradigms for 3D Vision (Arcangelo Distante, Cosimo Distante)....Pages 315-411
Shape from Shading (Arcangelo Distante, Cosimo Distante)....Pages 413-478
Motion Analysis (Arcangelo Distante, Cosimo Distante)....Pages 479-598
Camera Calibration and 3D Reconstruction (Arcangelo Distante, Cosimo Distante)....Pages 599-667
Back Matter ....Pages 669-676
Recommend Papers

Handbook of Image Processing and Computer Vision: Volume 3: From Pattern to Object [1st ed.]
 9783030423773, 9783030423780

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Arcangelo Distante Cosimo Distante

Handbook of Image Processing and Computer Vision Volume 3 From Pattern to Object

Handbook of Image Processing and Computer Vision

Arcangelo Distante • Cosimo Distante

Handbook of Image Processing and Computer Vision Volume 3: From Pattern to Object

123

Arcangelo Distante Institute of Applied Sciences and Intelligent Systems Consiglio Nazionale delle Ricerche Lecce, Italy

Cosimo Distante Institute of Applied Sciences and Intelligent Systems Consiglio Nazionale delle Ricerche Lecce, Italy

ISBN 978-3-030-42377-3 ISBN 978-3-030-42378-0 https://doi.org/10.1007/978-3-030-42378-0

(eBook)

© Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

To my parents and my family, Maria and Maria Grazia—Arcangelo Distante To my parents, to my wife Giovanna, and to my children Francesca and Davide—Cosimo Distante

Preface

In the last 20 years, several interdisciplinary researches in the fields of physics, information technology and cybernetics, the numerical processing of Signals and Images, and electrical and electronic technologies have led to the development of Intelligent Systems. The so-called Intelligent Systems (or Intelligent Agents) represent the still more advanced and innovative frontier of research in the electronic and computer field, able to directly influence the quality of life, competitiveness, and production methods of companies, to monitor and evaluate the environmental impact, to make public service and management activities more efficient, and to protect people’s safety. The study of an intelligent system, regardless of the area of use, can be simplified into three essential components: 1. The first interacts with the environment for the acquisition of data of the domain of interest, using appropriate sensors (for the acquisition of Signals and Images); 2. The second analyzes and interprets the data collected by the first component, also using learning techniques to build/update adequate representations of the complex reality in which the system operates (Computational Vision); 3. The third chooses the most appropriate actions to achieve the objectives assigned to the intelligent system (choice of Optimal Decision Models) interacting with the first two components, and with human operators, in case of application solutions based on man–machine cooperative paradigms (the current evolution of automation including industrial one). In this scenario of knowledge advancement for the development of Intelligent Systems, the information content of this manuscript is framed in which are reported the experiences of multi-year research and teaching of the authors, and of the scientific insights existing in the literature. In particular, the manuscript divided into three parts (volumes), deals with aspects of the sensory subsystem in order to perceive the environment in which an intelligent system is immersed and able to act autonomously. The first volume describes the set of fundamental processes of artificial vision that lead to the formation of the digital image from energy. The phenomena of light propagation (Chaps. 1 and 2), the theory of color perception (Chap. 3), the impact

vii

viii

Preface

of the optical system (Chap. 4), the aspects of transduction from luminous energy are analyzed (the optical flow) with electrical signal (of the photoreceptors), and aspects of electrical signal transduction (with continuous values) in discrete values (pixels), i.e., the conversion of the signal from analog to digital (Chap. 5). These first 5 chapters summarize the process of acquisition of the 3D scene, in symbolic form, represented numerically by the pixels of the digital image (2D projection of the 3D scene). Chapter 6 describes the geometric, topological, quality, and perceptual information of the digital image. The metrics are defined, the aggregation and correlation modalities between pixels, useful for defining symbolic structures of the scene of higher level with respect to the pixel. The organization of the data for the different processing levels is described in Chap. 7 while in Chapter 8, the representation and description of the homogeneous structures of the scene is shown. With Chapter 9 starts the description of the image processing algorithms, for the improvement of the visual qualities of the image, based on point, local, and global operators. Algorithms operating in the spatial domain and in the frequency domain are shown, highlighting with examples the significant differences between the various algorithms also from the point of view of the computational load. The second volume begins with the chapter describing the boundary extraction algorithms based on local operators in the spatial domain and on filtering techniques in the frequency domain. In Chap. 2 are presented the fundamental linear transformations that have immediate application in the field of image processing, in particular, to extract the essential characteristics contained in the images. These characteristics, which effectively summarize the global informational character of the image, are then used for the other image processing processes: classification, compression, description, etc. Linear transforms are also used, as global operators, to improve the visual qualities of the image (enhancement), to attenuate noise (restoration), or to reduce the dimensionality of the data (data reduction). In Chap. 3, the geometric transformations of the images are described, necessary in different applications of the artificial vision, both to correct any geometric distortions introduced during the acquisition (for example, images acquired while the objects or the sensors are moving, as in the case of satellite and/or aerial acquisitions), or to introduce desired visual geometric effects. In both cases, the geometrical operator must be able to reproduce as accurately as possible the image with the same initial information content through the image resampling process. In Chap. 4Reconstruction of the degraded image (image restoration), a set of techniques are described that perform quantitative corrections on the image to compensate for the degradations introduced during the acquisition and transmission process. These degradations are represented by the fog or blurring effect caused by the optical system and by the motion of the object or the observer, by the noise caused by the opto-electronic system and by the nonlinear response of the sensors, by random noise due to atmospheric turbulence or, more generally, from the process of digitization and transmission. While the enhancement techniques tend to reduce the degradations present in the image in qualitative terms, improving their

Preface

ix

visual quality even when there is no knowledge of the degradation model, the restoration techniques are used instead to eliminate or quantitatively attenuate the degradations present in the image, starting also from the hypothesis of knowledge of degradation models. Chapter 5, Image Segmentation, describes different segmentation algorithms, which is the process of dividing the image into homogeneous regions, where all the pixels that correspond to an object in the scene are grouped together. The grouping of pixels in regions is based on a homogeneity criterion that distinguishes them from one another. Segmentation algorithms based on criteria of similarity of pixel attributes (color, texture, etc.) or based on geometric criteria of spatial proximity of pixels (Euclidean distance, etc.) are reported. These criteria are not always valid, and in different applications, it is necessary to integrate other information in relation to the a priori knowledge of the application context (application domain). In this last case, the grouping of the pixels is based on comparing the hypothesized regions with the a priori modeled regions. Chapter 6, Detectors and descriptors of points of interest, describes the most used algorithms to automatically detect significant structures (known as points of interest, corners, features) present in the image corresponding to stable physical parts of the scene. The ability of such algorithms is to detect and identify physical parts of the same scene in a repeatable way, even when the images are acquired under conditions of lighting variability and change of the observation point with possible change of the scale factor. The third volume describes the artificial vision algorithms that detect objects in the scene, attempt their identification, 3D reconstruction, their arrangement and location with respect to the observer, and their eventual movement. Chapter 1, Object recognition, describes the fundamental algorithms of artificial vision to automatically recognize the objects of the scene, essential characteristics of all systems of vision of living organisms. While a human observer also recognizes complex objects, apparently in an easy and timely manner, for a vision machine, the recognition process is difficult, requires considerable calculation time, and the results are not always optimal. Fundamental to the process of object recognition become the algorithms for selecting and extracting features. In various applications, it is possible to have an a priori knowledge of all the objects to be classified because we know the sample patterns (meaningful features) from which we can extract useful information for the decision to associate (decision-making) each individual of the population to a certain class. These sample patterns (training set) are used by the recognition system to learn significant information about the objects population (extraction of statistical parameters, relevant characteristics, etc.). The recognition process compares the features of the unknown objects to the model pattern features, in order to uniquely identify their class of membership. Over the years, there have been various disciplinary sectors (machine learning, image analysis, object recognition, information research, bioinformatics, biomedicine, intelligent data analysis, data mining, …) and the application sectors (robotics, remote sensing, artificial vision, …) for which different researchers have proposed different methods of recognition and developed different algorithms based on different classification models. Although the proposed

x

Preface

algorithms have a unique purpose, they differ in the property attributed to the classes of objects (the clusters) and the model with which these classes are defined (connectivity, statistical distribution, density, …). The diversity of disciplines, especially between automatic data extraction (data mining) and machine learning (machine learning), has led to subtle differences, especially in the use of results and in terminology, sometimes contradictory, perhaps caused by the different objectives. For example, in data mining the dominant interest is automatic grouping extraction, in automatic classification the discriminating power of the pattern classes is fundamental. The topics of this chapter overlap between aspects related to machine learning and those of recognition based on statistical methods. For simplicity, the algorithms described are broken down according to the methods of classifying objects in supervised methods (based on deterministic, statistical, neural, and nonmetric models such as syntactic models and decision trees) and non-supervised methods, i.e., methods that do not use any prior knowledge to extract the classes to which the patterns belong. In Chap. 2 RBF, SOM, Hopfield and deep neural networks, four different types of neural networks are described: Radial Basis Functions—RBF, Self-Organizing Maps—SOM, the Hopfield, and the deep neural networks. RBF uses a different approach in the design of a neural network based on the hidden layer (unique in the network) composed of neurons in which radial-based functions are defined, hence the name of Radial Basis Functions, and which performs a nonlinear transformation of the input data supplied to the network. These neurons are the basis for input data (vectors). The reason why a nonlinear transformation is used in the hidden layer, followed by a linear one in the output one, allows a pattern classification problem to operate in a much larger space (in nonlinear transformation from the input in the hidden one) and is more likely to be linearly separable than a small-sized space. From this observation, derives the reason why the hidden layer is generally larger than the input one (i.e., the number of hidden neurons is greater than the cardinality of the input signal). The SOM network, on the other hand, has an unsupervised learning model and has the originality of autonomously grouping input data on the basis of their similarity without evaluating the convergence error with external information on the data. It is useful when there is no exact knowledge on the data to classify them. It is inspired by the topology of the brain cortex model considering the connectivity of the neurons and in particular, the behavior of an activated neuron and the influence with neighboring neurons that reinforce the connections compared to those further away that are becoming weaker. With the Hopfield network, the learning model is supervised and with the ability to store information and retrieve it through even partial content of the original information. It presents its originality based on physical foundations that have revitalized the entire field of neural networks. The network is associated with an energy function to be minimized during its evolution with a succession of states, until reaching a final state corresponding to the minimum of the energy function. This feature allows it to be used to solve and set up an optimization problem in terms of the objective function to be associated with an energy function. The

Preface

xi

chapter concludes with the description of the convolutional neural networks (CNN), by now the most widespread since 2012, based on the deep learning architecture (deep learning). In Chap. 3 Texture Analysis, the algorithms that characterize the texture present in the images are shown. Texture is an important component for the recognition of objects. In the field of image processing has been consolidated with the term texture, any geometric and repetitive arrangement of the levels of gray (or color) of an image. In this context, texture becomes an additional strategic component to solve the problem of object recognition, the segmentation of images, and the problems of synthesis. Some of the algorithms described are based on the mechanisms of human visual perception of texture. They are useful for the development of systems for the automatic analysis of the information content of an image obtaining a partitioning of the image in regions with different textures. In Chap. 4 3D Vision Paradigms are reported the algorithms that analyze 2D images to reconstruct a scene typically of 3D objects. A 3D vision system that has the fundamental problem typical of inverse problems, i.e., from single 2D images, which are only a two-dimensional projection of the 3D world (partial acquisition), must be able to reconstruct the 3D structure of the observed scene and eventually define a relationship between the objects. 3D reconstruction takes place starting from 2D images that contain only partial information of the 3D world (loss of information from the projection 3D!2D) and possibly using the geometric and radiometric calibration parameters of the acquisition system. The mechanisms of human vision are illustrated, based also on the a priori prediction and knowledge of the world. In the field of artificial vision, the current trend is to develop 3D systems oriented to specific domains but with characteristics that go in the direction of imitating certain functions of the human visual system. 3D reconstruction methods are described that use multiple cameras observing the scene from multiple points of view, or sequences of time-varying images acquired from a single camera. Theories of vision are described, from the Gestalt laws to the paradigm of Marr’s vision and the computational models of stereovision. In Chap. 5 Shape from Shading—(SfS) are reported the algorithms to reconstruct the shape of the visible 3D surface using only the brightness variation information (shading, that is, the level variations of gray or colored) present in the image. The inverse problem of reconstructing the shape of the surface visible from the changes in brightness in the image is known as the Shape from Shading problem. The reconstruction of the visible surface should not be strictly understood as a 3D reconstruction of the surface. In fact, from a single point of the observation of the scene, a monocular vision system cannot estimate a distance measure between observer and visible object, so with the SfS algorithms, there is a nonmetric but qualitative reconstruction of the 3D surface. It is described the theory of the SfS based on the knowledge of the light source (direction and distribution), the model of reflectance of the scene, the observation point, and the geometry of the visible surface, which together contribute to the image formation process. The relationships between the light intensity values of the image and the geometry of the visible surface are derived (in terms of the orientation of the surface, point by point) under

xii

Preface

some lighting conditions and the reflectance model. Other 3D surface reconstruction algorithms based on the Shape from xxx paradigm are also described, where xxx can be texture, structured light projected onto the surface to be reconstructed, or 2D images of the focused or defocused surface. In Chap. 6 Motion Analysis, the algorithms of perception of the dynamics of the scene are reported, analogous to what happens in the vision systems of different living beings. With motion analysis algorithms, it is possible to derive the 3D motion, almost in real time, from the analysis of sequences of time-varying 2D images. Paradigms on movement analysis have shown that the perception of movement derives from the information of the objects evaluating the presence of occlusions, texture, contours, etc. The algorithms for the perception of the movement occurring in the physical reality and not the apparent movement are described. Different methods of movement analysis are analyzed from those with limited computational load such as those based on time-variant image difference to the more complex ones based on optical flow considering application contexts with different levels of motion entities and scene-environment with different complexities. In the context of rigid bodies, from the motion analysis, derived from a sequence of time-variant images, are described the algorithms that, in addition to the movement (translation and rotation), estimate the reconstruction of the 3D structure of the scene and the distance of this structure by the observer. Useful information are obtained in the case of mobile observer (robot or vehicle) to estimate the collision time. In fact, the methods for solving the problem of 3D reconstruction of the scene are acquired by acquiring a sequence of images with a single camera whose intrinsic parameters remain constant even if not known (camera not calibrated) together with the non-knowledge of motion. The proposed methods are part of the problem of solving an inverse problem. Algorithms are described to reconstruct the 3D structure of the scene (and the motion), i.e., to calculate the coordinates of 3D points of the scene whose 2D projection is known in each image of the time-variant sequence. Finally, in Chap. 7 Camera Calibration and 3D Reconstruction, the algorithms for calibrating the image acquisition system (normally a single camera and stereovision) are fundamental for detecting metric information (detecting an object’s size or determining accurate measurements of object–observer distance) of the scene from the image. The various camera calibration methods are described that determine the relative intrinsic parameters (focal length, horizontal and vertical dimension of the single photoreceptor of the sensor, or the aspect ratio, the size of the matrix of the sensor, the coefficients of the radial distortion model, the coordinates of the main point or the optical center) and the extrinsic parameters that define the geometric transformation to pass from the reference system of the world to that of camera. The epipolar geometry introduced in Chap. 5 is described in this chapter to solve the problem of correspondence of homologous points in a stereo vision system with the two cameras calibrated and not. With the epipolar geometry is simplified the search for the homologous points between the stereo images introducing the Essential matrix and the Fundamental matrix. The algorithms for

Preface

xiii

estimating these matrices are also described, known a priori the corresponding points of a calibration platform. With epipolar geometry, the problem of searching for homologous points is reduced to mapping a point of an image on the corresponding epipolar line in the other image. It is possible to simplify the problem of correspondence through a one-dimensional point-to-point search between the stereo images. This is accomplished with the image alignment procedure, known as stereo image rectification. The different algorithms have been described; some based on the constraints of the epipolar geometry (non-calibrated cameras where the fundamental matrix includes the intrinsic parameters) and on the knowledge or not of the intrinsic and extrinsic parameters of calibrated cameras. Chapter 7 ends with the section of the 3D reconstruction of the scene in relation to the knowledge available to the stereo acquisition system. The triangulation procedures for the 3D reconstruction of the geometry of the scene without ambiguity are described, given the 2D projections of the homologous points of the stereo images, known the calibration parameters of the stereo system. If only the intrinsic parameters are known, the 3D geometry of the scene is reconstructed by estimating the extrinsic parameters of the system at less than a non-determinable scale factor. If the calibration parameters of the stereo system are not available but only the correspondences between the stereo images are known, the structure of the scene is recovered through an unknown homography transformation. Francavilla Fontana, Italy December 2020

Arcangelo Distante Cosimo Distante

Acknowledgments

We thank all the fellow researchers of the Department of Physics of Bari, of the Institute of Intelligent Systems for Automation of the CNR (National Research Council) of Bari, and of the Institute of Applied Sciences and Intelligent Systems “Eduardo Caianiello” of the Unit of Lecce, who have indicated errors and parts to be reviewed. We mention them in chronological order: Grazia Cicirelli, Marco Leo, Giorgio Maggi, Rosalia Maglietta, Annalisa Milella, Pierluigi Mazzeo, Paolo Spagnolo, Ettore Stella, and Nicola Veneziani. A thank you is addressed to Arturo Argentieri for the support on the graphic aspects of the figures and the cover. Finally, special thanks are given to Maria Grazia Distante who helped us realize the electronic composition of the volumes by verifying the accuracy of the text and the formulas.

xv

Contents

1 Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Classification Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Prior Knowledge and Features Selection . . . . . . . . . . . . . 1.4 Extraction of Significant Features . . . . . . . . . . . . . . . . . . 1.4.1 Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Selection of Significant Features . . . . . . . . . . . . . 1.5 Interactive Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Deterministic Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Linear Discriminant Functions . . . . . . . . . . . . . . 1.6.2 Generalized Discriminant Functions . . . . . . . . . . 1.6.3 Fisher’s Linear Discriminant Function . . . . . . . . . 1.6.4 Classifier Based on Minimum Distance . . . . . . . . 1.6.5 Nearest-Neighbor Classifier . . . . . . . . . . . . . . . . . 1.6.6 K-means Classifier . . . . . . . . . . . . . . . . . . . . . . . 1.6.7 ISODATA Classifier . . . . . . . . . . . . . . . . . . . . . 1.6.8 Fuzzy C-means Classifier . . . . . . . . . . . . . . . . . . 1.7 Statistical Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.1 MAP Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.2 Maximum Likelihood Classifier—ML . . . . . . . . . 1.7.3 Other Decision Criteria . . . . . . . . . . . . . . . . . . . . 1.7.4 Parametric Bayes Classifier . . . . . . . . . . . . . . . . . 1.7.5 Maximum Likelihood Estimation—MLE . . . . . . . 1.7.6 Estimation of the Distribution Parameters with the Bayes Theorem . . . . . . . . . . . . . . . . . . . 1.7.7 Comparison Between Bayesian Learning and Maximum Likelihood Estimation . . . . . . . . . 1.8 Bayesian Discriminant Functions . . . . . . . . . . . . . . . . . . . 1.8.1 Classifier Based on Gaussian Probability Density . 1.8.2 Discriminant Functions for the Gaussian Density .

. . . . . . . . . . . . . . . . . . . . . . .

1 1 3 4 6 8 10 17 17 19 20 21 28 30 30 34 35 37 37 39 48 48 49

.....

52

. . . .

56 57 58 61

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

xvii

xviii

Contents

1.9

1.10

1.11

1.12 1.13 1.14

1.15

1.16

1.17

1.18

Mixtures of Gaussian—MoG . . . . . . . . . . . . . . . . . . . . . . . 1.9.1 Parameters Estimation of the Gaussians Mixture with the Maximum Likelihood—ML . . . . . . . . . . . 1.9.2 Parameters Estimation of the Gaussians Mixture with Expectation–Maximization—EM . . . . . . . . . . 1.9.3 EM Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.4 Nonparametric Classifiers . . . . . . . . . . . . . . . . . . . Method Based on Neural Networks . . . . . . . . . . . . . . . . . . 1.10.1 Biological Motivation . . . . . . . . . . . . . . . . . . . . . . 1.10.2 Mathematical Model of the Neural Network . . . . . 1.10.3 Perceptron for Classification . . . . . . . . . . . . . . . . . 1.10.4 Linear Discriminant Functions and Learning . . . . . Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.11.1 Multilayer Perceptron—MLP . . . . . . . . . . . . . . . . 1.11.2 Multilayer Neural Network for Classification . . . . . 1.11.3 Backpropagation Algorithm . . . . . . . . . . . . . . . . . 1.11.4 Learning Mode with Backpropagation . . . . . . . . . . 1.11.5 Generalization of the MLP Network . . . . . . . . . . . 1.11.6 Heuristics to Improve Backpropagation . . . . . . . . . Nonmetric Recognition Methods . . . . . . . . . . . . . . . . . . . . Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.13.1 Algorithms for the Construction of Decision Trees . ID3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.14.1 Entropy as a Measure of Homogeneity of the Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.14.2 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . 1.14.3 Other Partitioning Criteria . . . . . . . . . . . . . . . . . . . C4.5 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.15.1 Pruning Decision Tree . . . . . . . . . . . . . . . . . . . . . 1.15.2 Post-pruning Algorithms . . . . . . . . . . . . . . . . . . . . CART Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.16.1 Gini Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.16.2 Advantages and Disadvantages of Decision Trees . Hierarchical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.17.1 Agglomerative Hierarchical Clustering . . . . . . . . . . 1.17.2 Divisive Hierarchical Clustering . . . . . . . . . . . . . . 1.17.3 Example of Hierarchical Agglomerative Clustering Syntactic Pattern Recognition Methods . . . . . . . . . . . . . . . 1.18.1 Formal Grammar . . . . . . . . . . . . . . . . . . . . . . . . . 1.18.2 Language Generation . . . . . . . . . . . . . . . . . . . . . . 1.18.3 Types of Grammars . . . . . . . . . . . . . . . . . . . . . . . 1.18.4 Grammars for Pattern Recognition . . . . . . . . . . . . 1.18.5 Notes on Other Methods of Syntactic Analysis . . .

....

66

....

68

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

69 73 76 87 87 89 93 100 109 109 111 113 118 120 121 125 125 127 129

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

129 131 135 137 138 140 143 143 146 148 149 151 152 154 156 158 161 163 171

Contents

xix

1.19 String Recognition Methods . . . . . . . . . . . . . . . . 1.19.1 String Matching . . . . . . . . . . . . . . . . . . . 1.19.2 Boyer–Moore String-Matching Algorithm 1.19.3 Edit Distance . . . . . . . . . . . . . . . . . . . . . 1.19.4 String Matching with Error . . . . . . . . . . . 1.19.5 String Matching with Special Symbol . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

172 172 174 183 188 190 191

SOM, Hopfield, and Deep Neural Networks . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cover Theorem on Pattern Separability . . . . . . . . . . . . . . . The Problem of Interpolation . . . . . . . . . . . . . . . . . . . . . . . Micchelli’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Learning and Ill-Posed Problems . . . . . . . . . . . . . . . . . . . . Regularization Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . RBF Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . RBF Network Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . Learning Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.1 Centers Set and Randomly Selected . . . . . . . . . . . 2.9.2 Selection of Centers Using Clustering Techniques . 2.10 Kohonen Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 2.10.1 Architecture of the SOM Network . . . . . . . . . . . . . 2.10.2 SOM Applications . . . . . . . . . . . . . . . . . . . . . . . . 2.11 Network Learning Vector Quantization-LVQ . . . . . . . . . . . 2.11.1 LVQ2 and LVQ3 Networks . . . . . . . . . . . . . . . . . 2.12 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 2.12.1 Hopfield Network . . . . . . . . . . . . . . . . . . . . . . . . . 2.12.2 Application of Hopfield Network to Discrete States 2.12.3 Continuous State Hopfield Networks . . . . . . . . . . . 2.12.4 Boltzmann Machine . . . . . . . . . . . . . . . . . . . . . . . 2.13 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.13.1 Deep Traditional Neural Network . . . . . . . . . . . . . 2.13.2 Convolutional Neural Networks-CNN . . . . . . . . . . 2.13.3 Operation of a CNN Network . . . . . . . . . . . . . . . . 2.13.4 Main Architectures of CNN Networks . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

193 193 194 196 199 200 201 205 207 208 208 210 210 211 219 220 223 223 225 228 233 236 238 239 240 248 256 257

. . . . . .

. . . . . .

. . . . . .

. . . . . .

261 261 261 263 264 265

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

2 RBF, 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9

3 Texture Analysis . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . 3.2 The Visual Perception of the Texture . . 3.2.1 Julesz’s Conjecture . . . . . . . . . 3.2.2 Texton Statistics . . . . . . . . . . . 3.2.3 Spectral Models of the Texture

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

xx

Contents

3.3 3.4

Texture Analysis and its Applications . . . . . . . . . . . . . . . . Statistical Texture Methods . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 First-Order Statistics . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Second-Order Statistics . . . . . . . . . . . . . . . . . . . . . 3.4.3 Higher Order Statistics . . . . . . . . . . . . . . . . . . . . . 3.4.4 Second-Order Statistics with Co-Occurrence Matrix 3.4.5 Texture Parameters Based on the Co-Occurrence Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Texture Features Based on Autocorrelation . . . . . . . . . . . . 3.6 Texture Spectral Method . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Texture Based on the Edge Metric . . . . . . . . . . . . . . . . . . . 3.8 Texture Based on the Run Length Primitives . . . . . . . . . . . 3.9 Texture Based on MRF, SAR, and Fractals Models . . . . . . 3.10 Texture by Spatial Filtering . . . . . . . . . . . . . . . . . . . . . . . . 3.10.1 Spatial Filtering with Gabor Filters . . . . . . . . . . . . 3.11 Syntactic Methods for Texture . . . . . . . . . . . . . . . . . . . . . . 3.12 Method for Describing Oriented Textures . . . . . . . . . . . . . . 3.12.1 Estimation of the Dominant Local Orientation . . . . 3.12.2 Texture Coherence . . . . . . . . . . . . . . . . . . . . . . . . 3.12.3 Intrinsic Images with Oriented Texture . . . . . . . . . 3.13 Tamura’s Texture Features . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Paradigms for 3D Vision . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction to 3D Vision . . . . . . . . . . . . . . . . . . . 4.2 Toward an Optimal 3D Vision Strategy . . . . . . . . . 4.3 Toward the Marr’s Paradigm . . . . . . . . . . . . . . . . . 4.4 The Fundamentals of Marr’s Theory . . . . . . . . . . . 4.4.1 Primal Sketch . . . . . . . . . . . . . . . . . . . . . 4.4.2 Toward a Perceptive Organization . . . . . . . 4.4.3 The Gestalt Theory . . . . . . . . . . . . . . . . . 4.4.4 From the Gestalt Laws to the Marr Theory 4.4.5 2.5D Sketch of Marr’s Theory . . . . . . . . . 4.5 Toward 3D Reconstruction of Objects . . . . . . . . . . 4.5.1 From Image to Surface: 2.5D Sketch Map Extraction . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Binocular Fusion . . . . . . . . . . . . . . . . . . . 4.6.2 Stereoscopic Vision . . . . . . . . . . . . . . . . . 4.6.3 Stereopsis . . . . . . . . . . . . . . . . . . . . . . . . 4.6.4 Neurophysiological Evidence of Stereopsis 4.6.5 Depth Map from Binocular Vision . . . . . . 4.6.6 Computational Model for Binocular Vision

. . . . . .

. . . . . .

. . . . . .

. . . . . .

265 267 267 268 268 270

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

272 276 278 281 282 286 290 295 302 303 304 306 307 308 313

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

315 315 316 318 319 320 325 326 338 342 344

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

347 350 351 354 356 358 374 377

Contents

xxi

4.6.7 4.6.8 Stereo 4.7.1 4.7.2

Simple Artificial Binocular System . . . . . . . . . . . General Binocular System . . . . . . . . . . . . . . . . . . 4.7 Vision Algorithms . . . . . . . . . . . . . . . . . . . . . . . . Point-Like Elementary Structures . . . . . . . . . . . . Local Elementary Structures and Correspondence Calculation Methods . . . . . . . . . . . . . . . . . . . . . . 4.7.3 Sparse Elementary Structures . . . . . . . . . . . . . . . 4.7.4 PMF Stereo Vision Algorithm . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Shape from Shading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The Reflectance Map . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Gradient Space . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 The Fundamental Relationship of Shape from Shading for Diffuse Reflectance . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Shape from Shading-SfS Algorithms . . . . . . . . . . . . . . . . 5.4.1 Shape from Stereo Photometry with Calibration . . 5.4.2 Uncalibrated Stereo Photometry . . . . . . . . . . . . . 5.4.3 Stereo Photometry with Calibration Sphere . . . . . 5.4.4 Limitations of Stereo Photometry . . . . . . . . . . . . 5.4.5 Surface Reconstruction from the Orientation Map 5.5 Shape from Texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Shape from Structured Light . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Shape from Structured Light with Binary Coding . 5.6.2 Gray Code Structured Lighting . . . . . . . . . . . . . . 5.6.3 Pattern with Gray Level . . . . . . . . . . . . . . . . . . . 5.6.4 Pattern with Phase Modulation . . . . . . . . . . . . . . 5.6.5 Pattern with Phase Modulation and Binary Code . 5.6.6 Methods Based on Colored Patterns . . . . . . . . . . 5.6.7 Calibration of the Camera-Projector Scanning System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Shape from (de)Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.1 Shape from Focus (SfF) . . . . . . . . . . . . . . . . . . . 5.7.2 Shape from Defocus (SfD) . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

383 389 393 394

. . . .

. . . .

. . . .

. . . .

. . . .

396 406 406 410

. . . .

. . . .

. . . .

. . . .

. . . .

413 413 414 417

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

420 423 426 433 436 439 440 447 451 454 456 458 458 462 462

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

463 465 465 472 477

6 Motion Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Analogy Between Motion Perception and Depth Evaluated with Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Toward Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Discretization of Motion . . . . . . . . . . . . . . . . . . . . 6.3.2 Motion Estimation—Continuous Approach . . . . . .

. . . . 479 . . . . 479 . . . .

. . . .

. . . .

. . . .

482 484 487 493

xxii

Contents

6.3.3 6.3.4 6.3.5

Motion Estimation—Discrete Approach . . . . . . . . . . Motion Analysis from Image Difference . . . . . . . . . Motion Analysis from the Cumulative Difference of Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.6 Ambiguity in Motion Analysis . . . . . . . . . . . . . . . . 6.4 Optical Flow Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Horn–Schunck Method . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Discrete Least Squares Horn–Schunck Method . . . . 6.4.3 Horn–Schunck Algorithm . . . . . . . . . . . . . . . . . . . . 6.4.4 Lucas–Kanade Method . . . . . . . . . . . . . . . . . . . . . . 6.4.5 BBPW Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.6 Optical Flow Estimation for Affine Motion . . . . . . . 6.4.7 Estimation of the Optical Flow for Large Displacements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.8 Motion Estimation by Alignment . . . . . . . . . . . . . . 6.4.9 Motion Estimation with Techniques Based on Interest Points . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.10 Tracking Based on the Object Dynamics—Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Motion in Complex Scenes . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Simple Method of Background Subtraction . . . . . . . 6.5.2 BS Method with Mean or Median . . . . . . . . . . . . . . 6.5.3 BS Method Based on the Moving Gaussian Average 6.5.4 Selective Background Subtraction Method . . . . . . . . 6.5.5 BS Method Based on Gaussian Mixture Model (GMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.6 Background Modeling Using Statistical Method Kernel Density Estimation . . . . . . . . . . . . . . . . . . . 6.5.7 Eigen Background Method . . . . . . . . . . . . . . . . . . . 6.5.8 Additional Background Models . . . . . . . . . . . . . . . . 6.6 Analytical Structure of the Optical Flow of a Rigid Body . . . 6.6.1 Motion Analysis from the Optical Flow Field . . . . . 6.6.2 Calculation of Collision Time and Depth . . . . . . . . . 6.6.3 FOE Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.4 Estimation of Motion Parameters for a Rigid Body . 6.7 Structure from Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.1 Image Projection Matrix . . . . . . . . . . . . . . . . . . . . . 6.7.2 Methods of Structure from Motion . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . 495 . . . 496 . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

496 498 499 504 509 511 513 517 521

. . . 523 . . . 526 . . . 535 . . . . . .

. . . . . .

. . . . . .

542 556 557 558 559 560

. . . 561 . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

563 564 565 566 570 573 577 580 585 585 590 596

7 Camera Calibration and 3D Reconstruction . . . . . . . . . . . . . . . . . . . 599 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599 7.2 Influence of the Optical System . . . . . . . . . . . . . . . . . . . . . . . . . 600

Contents

7.3 7.4

Geometric Transformations Involved in Image Formation . . . Camera Calibration Methods . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Tsai Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Estimation of the Perspective Projection Matrix . . . . 7.4.3 Zhang Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.4 Stereo Camera Calibration . . . . . . . . . . . . . . . . . . . 7.5 Stereo Vision and Epipolar Geometry . . . . . . . . . . . . . . . . . 7.5.1 The Essential Matrix . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 The Fundamental Matrix . . . . . . . . . . . . . . . . . . . . 7.5.3 Estimation of the Essential and Fundamental Matrix 7.5.4 Normalization of the 8-Point Algorithm . . . . . . . . . . 7.5.5 Decomposition of the Essential Matrix . . . . . . . . . . 7.5.6 Rectification of Stereo Images . . . . . . . . . . . . . . . . . 7.5.7 3D Stereo Reconstruction by Triangulation . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xxiii

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

602 603 605 610 616 625 627 629 634 638 641 642 645 655 666

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669

1

Object Recognition

1.1 Introduction The ability of object recognition is an essential feature of all living organisms. Various creatures have different abilities and modes of recognition. Very important is the sensorial nature and the modality of interpretation of the available sensorial data. Evolved organisms like humans can recognize other humans through sight, voice, or how they write, while less evolved organisms like the dog can recognize other animals or humans, simply, using the olfactory and visual sense organ. These activities are classified as recognition. While a human observer also performs the recognition of complex objects, apparently in an easy and timely manner, for a vision machine, the recognition process is difficult, requires considerable calculation time, and the results are not always optimal. The ability of a vision machine is to automatically recognize the objects that appear in the scene. Normally, a generic object observed for the recognition process is called patter n. In several applications, the patter n adequately describes a generic object with the purpose of recognizing it. A pattern recognition system can be specialized to recognize people, animals, territory, artifacts, electrocardiogram, biological tissue, etc. The most general ability of a recognition system can be to discriminate between a population of objects and determine those that belong to the same class. For example, an agro-food company needs a vision system to recognize different qualities of fruit (apples, pears, etc.), depending on the degree of ripeness and size. This means that the recognition system will have to examine the whole population and classify each fruit thus obtaining different groupings that identify certain quality classes. The de facto recognition system is a classification system. If we consider, for simplicity, the population of only apples to determine the class to which each apple belongs, it is necessary that be adequately described, that is to

© Springer Nature Switzerland AG 2020 A. Distante and C. Distante, Handbook of Image Processing and Computer Vision, https://doi.org/10.1007/978-3-030-42378-0_1

1

2

1 Object Recognition

find its intrinsic characteristics (features),1 functional to determine the correct class of membership (classi f ication). The expert, in this case, can propose to characterize the apple population using the color (which indicates the state of maturation) and the geometric shape (almost circular), possibly a measure of the area and to give greater robustness to the classification process, the characteristic weight could also be used. We have actually illustrated how to describe a pattern (in this case, the apple pattern is seen as a set of features that characterizes the apple object), an activity that in literature is known as selection of significant features (namely, features selection). The activity instead of producing measures associated with the characteristics of the pattern is called feature extraction. The selection and extraction of the feature are important activities in the design of a recognition system. In various applications, it is possible to have a priori knowledge of the population of the objects to be classified because we know the sample patterns from which useful information can be extracted for the decision to associate (decision-making) each individual of the population to a specific class. These sample patterns (e.g., training set) are used by the recognition system to learn meaningful information about the population (extraction of statistical parameters, relevant features, etc.). In this context of object recognition, Clustering (Group Analysis) becomes central, i.e., the task of grouping a collection of objects in such a way that objects in the same group (called clusters) are more similar than those of the other groups. In the example of apples, the concept of similarity is associated with the color, area, and weight measurements used as descriptors of the apple pattern to define apple clusters with different qualities. A system of recognition of very different objects, for example, the patterns apples and pears, requires a more complex description of patterns, in terms of selection and extraction of significant features, and clustering methods, for a correct classification. In this case, for the recognition of quality classes of the apple or pear pattern, it is reasonable to add the feature of shape (elliptic), in the context selection of features to discriminate between the two patterns. The recognition process compares the features of the unknown objects with the features of the pattern samples, in order to uniquely identify the class to which they belong. In essence, the classi f ication is the final goal of a recognition system based on the clustering. In this introduction of the analysis of the observed data of a population of objects, two activities emerged: the selection and the extraction of the features. The feature extraction activity also has the task of expressing them with an adequate metric to be used, appropriately by the decision component, that is, the classifier that determines, based on the chosen method of clustering, which class associates with each object. In reality, the features of an object do not always describe it correctly, they often represent only a good approximation and therefore while developing a complex classification process with different recognition strategies, it can be difficult to identify the object in a unique class of belonging.

1 From

now on, the two words “characteristic” and “feature” will be used interchangeably.

1.2 Classification Methods

3

1.2 Classification Methods The first group analysis studies were introduced to classify the psychological traits of the personality [1,2]. Over the years there have been several disciplinary sectors (machine learning, image analysis, object recognition, information research, bioinformatics, biomedicine, intelligent data analysis, data mining, ...) and application sectors (robotics, remote sensing, artificial vision, ...) for which several researchers have proposed different clustering methods and developed different algorithms based on different types of clusters. Although the proposed algorithms have univocal purposes, they differ in the property attributed to the clusters and in the model with which the clusters are defined (connectivity, statistical distribution, density, ...). The diversity of disciplines, especially those of data mining2 and machine learning, has led to subtle differences especially in the use of results and sometimes contradictory terminologies perhaps caused by different objectives. For example, in data mining, the dominant interest is the automatic extraction of groups, in the automatic classification, the discriminating power of the classes to which the patterns belong is fundamental. The topics of this chapter overlap between aspects related to machine learning and those of recognition based on statistical methods. Object classification methods can be divided into two categories: Supervised Methods, i.e., based on a priori knowledge of data (cluster model, decision rules, ...) to better define the discriminating power of the classifier. Given a population of objects and a set of measures with which to describe the single object represented by a vector pattern x, the goal of these methods is to assign the pattern to one of the possible N classes Ci , i = 1, . . . , N . A decision-making strategy is adopted to partition the space of the measures into N regions i , i = 1, . . . , N . The expectation is to use rules or decision functions such as to obtain separate, nonoverlapping regions of i , bounded by separation surfaces. This can be achieved by iterating the classification process after analyzing the decision rules and the adequacy of the measures associated with the patterns. This category also includes methods where it is assumed that the knowledge of the probability density function (PDF) of the feature vector for a given class is known. In this category, the following methods will be described: (a) (b) (c) (d)

Deterministic models Statistical models Neural models Nonmetric Models (Syntactic and Decision Trees)

2 Literally,

it means automatic data extraction, normally coming from a large population of data.

4

1 Object Recognition

Non-Supervised Methods, i.e., methods that do not use any prior knowledge to extract the classes to which the patterns belong. Often in statistical literature, these methods are referred to as clustering. In this context, initially objects cannot be labeled and the goal is to explore the data to find an approach that groups them once selected features that distinguish one group from another. The assignment of the labels for each cluster takes place subsequently with the intervention of the expert. An interlocutory phase can be considered using a supervised approach on the first classes generated to extract representative samples of some classes. This category includes the hierarchical and partitioning clustering methods that in turn differ in the criteria (models) of clustering adopted: (a) (b) (c) (d) (e)

Single Linkage (Nearest Neighbor) Complete Linkage (Furthest Neighbor) Centroid Linkage Sum of squares Partitioning.

1.3 Prior Knowledge and Features Selection The description of an object is defined by a set of scalar entities of the object itself that are combined to constitute a vector of features x (also called feature vector). Considering the features studied in Chap. 8 on Forms Vol. I, an object can be completely described by a vector x whose components x = (x1 , x2 , . . . , x M ) (represented, for example, by some measures which dimensions, compactness, perimeter, area, etc.) are some of the geometric and topological information of an object. In the case of multispectral images, the problem of feature selection becomes the choice of the most significant bands where the feature vector represents the pixel pattern with the radiometric information related to each band (for example, the spectral bands from visible to infrared). In other contexts, for example, in the case of a population of economic data, the problem of feature selection may be more complex because some measurements are not accessible. An approach to the analysis of the features to be selected consists in considering the nature of the available measures, i.e., whether of a physical nature (as in the case of spectral bands and color components) or of a structural nature (such as geometric ones) or derived from mathematical transformations of the previous measures. The physical features are derived from sensory measurements of which information on the behavior of the sensor and the level of uncertainty of the observed measurements may be available. Furthermore, information can be obtained (or found experimentally) on the correlation level of the measurements observed for the various sensors. Similarly, information on the characteristics of structural measures can be derived.

1.3 Prior Knowledge and Features Selection Sensors: Data Representation

Selection Extraction Features

5

Formulation

Object hypothesis

hypothesis

Basic knowledge

Fig. 1.1 Functional scheme of an object recognition system based on knowledge Sensors: Data Representation

Selection Extraction Features

Clastering Template Matching

Object

Features model

Fig. 1.2 Functional scheme of an object recognition system based on template matching

The analysis and selection of the features is useful to do it considering also the strategies to be adopted in the recognition process. In general, the functioning mechanism of the recognition process involves the analysis of the features extracted from the observed data (population of objects, images, ...), the formalization of some hypotheses to define elements of similarity between the observed data and those extracted from the sample data (in the supervised context). Furthermore, it provides for the formal verification of the hypotheses using the models of the objects and eventually reformulate the approach of similarity of the objects. Hypothesis generation can also be useful to reduce the search domain considering only some features. Finally, the recognition process selects among the various hypotheses, as a correct pattern, the one with the highest value of similarity on the basis of evidence (see Fig. 1.1). The vision systems typical of object recognition formulate hypotheses and identify the object on the basis of the best similarity. Vision systems based on a priori knowledge consider hypotheses only as the starting point, while the verification phase is assigned the task of selecting the object. For example, in the recognition of an object, based on the comparison of the characteristics extracted from the scene observed with those of the sample prototypes, the approach called template matching is used, and the hypothesis formation phase is completely eliminated (see Fig. 1.2). From the previous considerations, it emerges how a recognition system is characterized by different components: 1. types of a priori knowledge, 2. types of features extracted and to be considered, 3. type of comparison between the features extracted from the observed data and those of the model, 4. how to form assumptions, 5. verification methods.

6

1 Object Recognition

Points (1), (2), (3) are mutually interdependent. The representation of an object depends on the type of object itself. In some cases, certain geometric and topological characteristics are significant, for other cases, they may be of little significance and/or redundant. Consequently, the component of the features extraction must determine those foreseen for the representation of the object that are more adequate considering their robustness and their difficulty in extracting from the input data. At the same time, in the selection of features, those that can best be extracted from the model and that can best be used for comparison must be selected. Using many exhaustive features can make the recognition process more difficult and slow. The hypothesis formation approach is a normally heuristic approach that can reduce the search domain. In relation to the type of application, a priori knowledge can be formalized by associating a priori probability or confidence level with the various model objects. These prediction measures constitute the elements to evaluate the similarity of the presence of objects based on the determined characteristics. The verification of these hypotheses leads to the methods of selecting the models of the objects that best resemble those extracted from the input data. All plausible hypotheses must be examined to verify the presence of the object or discard it. In vision machine applications, where geometric modeling is used, objects can be modeled and verified using camera location information or other known information of the scene (eg known references in the scene). In other applications, the hypotheses cannot be verified and one can proceed with non-supervised approaches. From the above considerations, the functional scheme of a recognition system can be modified by eliminating the verification phase or the hypothesis formation phase (see Fig. 1.1).

1.4 Extraction of Significant Features In Chap. 5 vol. II of the Segmentation and in Chap. 8 Vol. I on the For ms, we have examined the image processing algorithms that extract some significant characteristics to describe the objects of a scene. Depending on the type of application, the object to be recognized as a physical entity can be anything. In the study of terrestrial resources and their monitoring, the objects to be recognized are for example the various types of woods and crops, lakes, rivers, roads, etc. In the industrial sector, on the other hand, for the vision system of a robot cell, the objects to be recognized are, for example, the individual components of a more complex object, to be assembled or to be inspected. For a food industry, for example, a vision system can be used to recognize different qualities of fruit (apples, pears, etc.) in relation to the degree of ripeness and size. In all these examples, similar objects can be divided into several distinct subsets. This new grouping of objects makes sense only if we specify the meaning of a similar object and if we find a mechanism that correctly separates similar objects to form socalled classes of objects. Objects with common features are considered similar. For example, the set of ripe and first quality apples are those characterized by a particular color, for example, yellow-green, with a given almost circular geometric shape,

1.4 Extraction of Significant Features

7

and a measure of the area, higher than a certain threshold value. The mechanism that associates a given object to a certain class is called clustering. The object recognition process essentially uses the most significant features of objects to group them by classes (classification or identification). Question: is the number of classes known before the classification process? Answer: normally classes are known and intrinsically defined in the specifications imposed by the application but are often not known and will have to be explored by analyzing the observed data. For example, in the application that automatically separates classes of apples, the number of classes is imposed by the application itself (for commercial reasons, three classes would be sufficient) deciding to classify different qualities of apples considering the parameters of shape, color, weight, and size. In the application of remotely sensed images, the classification of forests, in relation to their deterioration caused by environmental impact, should take place without knowing in advance the number of classes that should automatically emerge in relation to the types of forests actually damaged or polluted. The selection of significant features is closely linked to the application. It is usually made based on experience and intuition. Some considerations can, however, be made for an optimum choice of the significant features. The first consideration concerns their ability to be discriminating or different values correspond to different classes. In the previous example, the measurement of the ar ea was a significant feature for the classification of apples in three classes of different sizes (small, medium, and large). The second consideration regards reliability. Similar values of a feature must always identify the same class for all objects belonging to the same class. Considering the same example, it can happen that a large apple can have a different color than the class of ripe apples. In this case, the color may be a nonsignificant feature. The third consideration relates to the correlation between the various features. The two features ar ea and weight of the apple are strongly correlated features since it is foreseeable that the weight increases in proportion to the measure of the area. This means that these two characteristics are the same property and therefore are redundant features that may not be meaningful to select them together for the classification process. The related features can be used together instead when you want to attenuate the noise. In fact, in multisensory applications, it occurs that some sensors are strongly correlated with each other but the relative measurements are affected by different models of noise. In other situations, it is possible to accept the redundancy of a feature, on one condition, that is not correlated with at least one of the selected features (in the case of the apple-mature class, it has been seen that the color-area features alone may not discriminate, in this case, the characteristic weight becomes useful, although very correlated with the area, considering the true hypothesis, that the ripe apples are on average less heavy than the less mature ones).

8

1 Object Recognition

1.4.1 Feature Space The number of selected features must be adequate and contained to limit the level of complexity of the recognition process. A classifier that uses few features can produce inappropriate results. In contrast, the use of a large number of features leads to an exponential growth of computational resources without the guarantee of obtaining good results. The set of all the components xi of the vector pattern x = (x1 , x2 , . . . , x M ) is the space of the features to M-dimensions. We have already considered how for the remote sensing images for each pixel pattern different spectral characteristics are available (infrared, visible, etc.) to extract the homogeneous regions from the observed data corresponding to different areas of the territory. In these applications, the classification of the territory, using a single characteristic (single spectral component), would be difficult. In this case, the analyst must select the significant bands filtered by the noise and eventually reduce the dimensionality of the pixels. In other contexts, the features associated with each object are extracted producing features with higher level information (for example, normalized spatial moments, Fourier descriptors, etc. described in Chap. 8 Vol. I of the For ms) related to elementary regions representative of objects. An optimal feature selection is obtained when the pattern x vectors belonging to the same class lie close to each other when projected into the feature space. These vectors constitute the set of similar objects represented in a single class (cluster). In the feature space, it is possible to accumulate different classes (corresponding to different types of objects) which can be separated using appropriate discriminating functions. The latter represent the cluster separation hypersurfaces in the feature space of M-dimensions, which control the classification process. Hypersurfaces can be simplified with hyperplanes and in this case, we speak of linearly separable discriminating functions. Figure 1.3 shows a two-dimensional (2D) example of the feature space (x1 , x2 ) where 4 clusters are represented separated by linear and nonlinear discriminant functions, representing the set of homogeneous pixels corresponding in the spatial domain to 5 regions. This example shows that the two selected features x1 and x2 exhibit an adequate discriminating ability to separate the population of pixels in the space of the features (x1 , x2 ) which in the spatial domain belong to 5 regions corresponding to 4 different classes (the example schematizes a portion of land wet by the sea with a river: class 1 indicates the ground, 2 the river, 3 the shoreline and 4 the sea). The ability of the classification process is based on the ability to separate without error the various clusters, which in different applications are located very close to each other, or are superimposed generating an incorrect classification (see Fig. 1.4). This example demonstrates that the selected features do not intrinsically exhibit a good discriminating power to separate in the feature space the patterns that belong to different classes regardless of the discriminant function that describes the cluster separation plan.

1.4 Extraction of Significant Features

9

Space Domain

Features Domain

x 4 1

4

3

2 1

1

3

Band x

2

Band x Multispectral image

Fig. 1.3 Spatial domain represented by a multispectral image with two bands and 2D domain of features where 4 homogeneous pattern classes are grouped corresponding to different areas of the territory: bare terrain, river, shoreline, and sea

C

A

B

C

Fig. 1.4 Features domain (x1 , x2 ) with 3 pattern classes and the graphical representation (2D and 1D) of the relevant Gaussian density functions

B A

B

A C

The following considerations emerge from the previous examples: (a) The characteristics of the objects must be analyzed eliminating those that are strongly correlated that do not help for the discrimination between objects. Instead, they are to be used to filter out any noise between related features. (b) The underlying problem of a classifier depends on the fact that not always in the feature space, the classes of objects are well separated. Very often the pattern vectors of the features can belong to more than one cluster (see Fig. 1.4) and the classifier can make mistakes associating some of them to a class of incorrect membership. This can be avoided by eliminating features that have little discriminating power (for example, the feature x1 in Fig. 1.4 does not separate the object classes {A, B, C}, while the x2 well separates the classes {A, C}).

10

1 Object Recognition

(c) A solution to the previous point is given by performing transformations on the measures of the features such as to produce, in the space of the features, a minimization of the distance intra-class, i.e., the mean quadratic distance between patterns of the same class and maximi ze the distance intra-class, that is the mean quadratic distance between patterns that belong to different classes (see Fig. 1.4). (d) In the selection of the features, the opportunity to reduce their dimensionality must be considered. This can be accomplished with attempts by verifying the classifier’s performance after filtering some features. A more direct way is to use the transform to the principal components (see Sect. 2.10.1 Vol. II) which estimates the level of significance of each feature. The objective then becomes the reduction of the dimensionality from M-dimensions to a reduced space to pdimensions which are the most significant components (extraction of significant features). (e) The significance of the features can also be evaluated with nonlinear transformations by evaluating the performance of the classifier and then giving a different weight to the various features. The extraction of significant features is also considered on the basis of a priori knowledge of the probability distribution of the classes. In fact, if the latter is known, for each class, the selection of the features can be done by minimizing the entropy to evaluate the similarity of the classes. (f) A classifier does not always obtain better results if a considerable number of characteristics are used, indeed in different contexts, it is shown that few properly selected features prove to be more discriminating.

1.4.2 Selection of Significant Features Let us analyze with an example (see Fig. 1.4) the hypothesized normal distribution of three classes of objects in the one-dimensional feature space x1 and x2 and twodimensional (x1 , x2 ). It is observed that the A class is clearly separated from the B and C classes, while the latter show overlapping zones in both the features x1 and x2 . From the analysis of the features x1 and x2 , it is observed that these are correlated (the patterns are distributed in a dominant way on the diagonal of the plane (x1 x2 )) that is to similar values of x1 correspond analogous values for x2 . From the analysis of one-dimensional distributions, it is observed that the A class is well separable with the single feature x2 , while the B and C classes cannot be accurately separated with the distribution of the features x1 and x2 , as the latter are very correlated. In general, it is convenient to select only those features that have a high level of orthogonality, that is, the distribution of the classes in the various characteristics should be very different: while in a feature the distribution is located toward the low values, in another feature the same class must have a different distribution. This can be achieved by selecting or generating unrelated features. From the geometric point of view, this can be visualized by imagining a transformation of the original variables, such that, in the new system of orthogonal axes, they are ordered in terms of quantity of variance of the original data (see Fig. 1.5). In the Chapter Linear Trans-

1.4 Extraction of Significant Features

Y

YPCA=AX

Pk Y

Fig. 1.5 Geometric interpretation of the PCA transform. The transform from the original measures x1 , x2 to the main components y1 , y2 is equivalent to rotating the coordinate axes until the maximum variance is obtained when all the patterns are projected on the OY1 axis

11

P’k θ 0

formations, Sect. 2.10.1 Vol. II, we have described the transform proper orthogonal decomposition—POD, better known as principal component analysis—PCA which has this property, and to represent the original data in a significant way through a small group of new variables, precisely the components principal. With this data transformation, the expectation is to describe most of the information (variance) of the original data with a few components. The PCA is a direct transformation on the original data, and no assumption is made about their distribution or number of classes present, and therefore behaves like an unsupervised feature extraction method. To better understand how the PCA behaves, consider the pattern distribution (see Fig. 1.5) represented by the features (x1 , x2 ) which can express, for example, respectively, the length measurements (meters, centimeters, ...) and weight (grams, kilograms, ...), that is, quantities with different units of measurement. For a better graphical visibility of the pattern distribution, in the domain of the features (x1 , x2 ), we imagine that these realizations have a Gaussian distribution. With this hypothesis in the feature domain, the patterns are arranged in ellipsoidal form. If we now rotate the axes from the reference system (x1 , x2 ) to the system (y1 , y2 ), the ellipsoidal shape of the patterns remains the same while only the coordinates have changed. In this new system, there can be a convenience in representing these realizations. Since the axes are rotated, the relationship between the two reference systems can be expressed as follows: 

y1 y2



 =

cos θ sin θ − sin θ cos θ



x1 x2

 or

y1 = x1 cos θ + x2 sin θ y2 = x1 sin θ + x2 cos θ

(1.1)

where θ is the angle between the homologous (horizontal and vertical) axes of the two reference systems. It can be observed from these equations that the new coordinate, i.e., the transformed feature y1 is a linear combination of length and weight measurements (with both positive coefficients), while the second new coordinate, i.e., the feature y2 is a linear combination always between the length and weight measurements but with opposite signs of the coefficients. That said, we observe from Fig. 1.5 that there is an imbalance in the pattern distribution with a more pronounced dispersion on the first axis.

12

1 Object Recognition

This means that the projection of the patterns on the new axis y1 can be a good approximation of the entire ellipsoidal distribution. This is equivalent to saying that the set of realizations represented by the ellipsoid can be signi f icantly represented by the single new feature y1 = x1 cos θ + x2 sin θ instead of indicating for each pattern the original measures x1 and x2 . It follows that, considering only the new feature y1 , we get a size reduction from 2 to 1 to represent the population of the patterns. But be careful, the concept of meaningful representation with the new feature y1 must be specified. In fact, y1 can take different values as the angle θ varies. It is, therefore, necessary to select a value of θ which may be the best representation of the relationship that exists between the population patterns in the feature domain. This is guaranteed by selecting the value of θ which minimizes the translation of the points with the projection with respect to the original position. Given that the coordinates of the patterns with respect to the Y1 axis are their orthogonal projections precisely on the OY1 axis, the solution is given by the line whose distance from the points is minimal. Indicating withPk a generic pattern and with Pk its orthogonal projection on the axis OY1 , the orientation of the best line is the one that minimizes the sum, given by [3] N 

Pi Pi

2

i=1

If the Pk O Pk triangle is considered and the Pythagorean theorem is applied, we obtain 2 2 2 O Pk = Pk Pk + O Pk Repeating for all the N patterns, adding and dividing by N −1, we have the following: 2 2 1  1  1  2 O Pk = Pk Pk + O Pk N −1 N −1 N −1 N

N

N

k=1

k=1

k=1

(1.2)

Analyzing the (1.2) it results that the first member is constant for all the patterns and is independent of the reference system. It follows that choosing the orientation of the OY1 axis is equivalent to minimi zing the expression of the first addend of the (1.2) or maximi zing the expression of the second addend of the same equation. In the hypothesis that O represents the mass center of all i pattern,3 the expression N 2 of the second addend N 1−1 k=1 O Pk corresponds to the variance of the pattern projections on the new axis Y1 . Choosing the OY1 axis that minimizes the sum of the squares of the perpendicular distances from this axis is equivalent to selecting the OY1 axis in such a way that the 3 Without

loosing generality, this is achieved by expressing both the input variables xi and output yi , in terms of deviations from the mean.

1.4 Extraction of Significant Features

13

projections of the patterns on it result with the maximum variance. These assumptions are based on the search for principal components formulated by Hotelling [4]. It should be noted that with the approach to the principal components, reported above, the sum of the squared distances between pattern and axis is minimized, while with the least squares approach, it is different, the sum is squared to the horizontal distances of patterns from the line (represented by the OY1 axis in this case). This leads to a different solution (linear regression). Returning to the principal component approach, the second component is defined, in the orthogonal direction to the f ir st, and represents the maximum of the remaining variance of the pattern distribution. For patterns with N dimensions, subsequent components are obtained in a similar way. It is understood that the peculiarity of the PCA is to represent a set of patterns in the most sparse way possible along the principal axes. Imagining a Gaussian distribution with a M-dimensions pattern, the dispersion assumes an ellipsoidal shape,4 the axes of which are oriented with the principal components. In Sect. 2.10.1 Vol. II, we have shown that to calculate the axes of these ellipsoidal structures (they represent the density level set), the covariance matrix K S was calculated for a multispectral image S of N pixels with M bands (which here represent the variables of each pixel pattern). Recall that the covariance matrix K S integrates the variances of the variables and the covariances between different variables. Furthermore, in the same paragraph, we showed how to get the transform to the principal components: Y PC A = A · S through the diagonali zation of the covariance matrix K S , obtaining the orthogonal matrix A having the eigenvector s ak , k = 1, 2, . . . , M and the diagonal matrix  of the eigenvalues λk , k = 1, 2, . . . , M. It follows that the i-th principal component is given by with i = 1, . . . , M (1.3) yi = = aiT x with the variances and covariances of the new components expressed by: V ar (yi ) = λi

with

i = 1, . . . , M

Cov(yi , y j ) = 0

for

i = j (1.4)

4 Indeed, an effective way to represent the graph of the normal multivariate density function N (0, ) is made by curves of level c. In this case, the function is positive and the level curves to be examined concern values of c > 0 with a positive and invertible covariance matrix. It is shown that the equation of an ellipsoid results to be x T  −1 x = c centered in the origin. In the reference system of the principal components, these are expressed in the bases of the eigenvectors of the covariance y2

y2

matrix  and the equation of the ellipsoid becomes λ11 + · · · + λMM = c, with the length of the √ √ semi-axes equal to λ1 , . . . , λ M , where λi are the eigenvalues of covariance matrix. For M = 2, we have elliptic contour lines. If μ = 0, the ellipsoid is centered in μ.

14

1 Object Recognition

Finally, we highlight the property that the initial total variance of the pattern population is equal to the total variance of the principal components: M 

2 V ar (xi ) = σ12 +, · · · , σ M =

i=1

M 

V ar (yi ) = λ1 +, · · · , λ M

(1.5)

i=1

although distributed with different weights and decreasing in the principal components considered the decreasing order of the eigenvalues: λ1 ≥ λ2 ≥ · · · ≥ λ M ≥ 0 In the 2D geometrical representation of the principal components, the major and minor axes of the ellipsoids, mentioned above, are aligned with the √ respectively, √ eigenvectors a1 and a2 , while their lengths are given from λ1 and λ2 (considering the eigenvalues ordered in descending order). The plane identified by the first two axes is called principal plane and projecting the patterns on this plane means to calculate their coordinates given by y1 = = a1T x

y2 = = a2T x

(1.6)

where x indicates the generic input pattern. The process can be extended for all M principal axes, mutually orthogonal, even if the graphic display becomes difficult beyond the three-dimensional (3D) representation. Often the graphical representation is useful at an exploratory level to observe how the patterns are grouped in the features space, i.e., how the latter are related to each other. For example, in the classification of multispectral images, it may be useful to explore different 2D projections of the principal components to explore how the ellipsoidal structures are arranged to separate the homogeneous pattern classes (as informally anticipated with Fig. 1.4). The elongated shape of an ellipsoid indicates that one axis is very short with respect to the other and informs us of the little variability in that direction of that component and consequently projecting all the patterns , we get the least loss of information. This last aspect, the variability of the new components, is connected to the problem of the variability of the input data that can be different, both in terms of measurement (for example, in the area-weight case, the first is the measure linked to the length, the second indicates a measure of force of gravity), and both in terms of dynamics of the range of variability although expressed in the same unit of measurement. The different variability of the input data tends to influence the first principal components, thus distorting the exploratory analysis. The solution to this problem is obtained by activating a standardization procedure of the input data, before applying the transform to the principal components. This procedure consists in transforming the original data xi , i = 1, . . . , N into the normalized data zi as follows: zi j =

xi j − μ j σj

(1.7)

1.4 Extraction of Significant Features

15

where μ j and σ j are, respectively, the mean and standard deviation of the j-th feature, that is,   N N  1   1  μj = xi j σj = (xi j − μ j )2 (1.8) N N −1 i=1

i=1

In this way, each feature has the same mean zero and the same standard deviation 1, and if we calculate the covariance matrix Kz , this coincides with the correlation matrix Rx . Each element r jk of the latter represents the normalized covariance between two features, called Pearson’s correlation coefficient (linear relation measure) between the features x j and xk obtained as follows: Cov(x j , xk ) 1  = = z i j z ik σx j σxk N −1 N

r jk

(1.9)

i=1

By virtue of the inequality of Hölder [5], it is shown that the correlation coefficients ri j have value |ri j | ≤ 1 and in particular, the elements rkk of the principal diagonal are all equal to 1 and represent the variance of a standardized feature. Great absolute value of r jk corresponds to a high linear relationship between the two features. For |r jk | = 1, the values of x j and xk lie exactly on a line (with positive slope if the coefficient is positive; with negative slope if the coefficient is negative). This property of the correlation matrix explains the invariance for change of unit of measure (scale invariance) unlike the covariance matrix. Another property of the correlation coefficient is the fact that if the features are not related to each other (E[σxl , σxk ] = μx j μxk ), we have rσxl ,σxk = 0 while for the covariance if Cov(σxl , σxk ) = 0 is not said to be uncorrelated. Having two operating modes available, with the original data and standardized data, the principal components could be calculated, respectively, by diagonalizing the covariance matrix (data sensitive to the change of scale) or the correlation matrix (normalized data). This choice must be carefully evaluated based on the nature of the available data considering that the analysis of the principal components leads to different results in using the two matrices on the same data. A reasonable criterion could be to standardize data when these are very different in terms of scale. For example, in the case of multispectral images, a characteristic, represented by the broad dynamics of a band, can be in contrast with another that represents a band with a very restricted dynamic. Instead having homogeneous data available, for the various features, it would be better to explore with the analysis of the principal components without performing data standardization. A direct advantage, offered in operating with the R correlation matrix, is given by the possibility of being able to compare data of the same nature but acquired at different times. For example, the classification of multispectral images relating to the same territory, acquired with the same platform, but at different times. We now return to the interpretation of the principal components (in the literature also called latent variables). The latter term is indicated precisely to express

16

1 Object Recognition

the impossibility of giving a direct formalization and meaning to the principal components (for example, the creation of a model). In essence, the mathematical tool indicates the direction of the components where the information is significantly concentrated, but the interpretation of these new components is left to the analyst. For a multispectral image, the characteristics represent the spectral information associated with the various bands (different in the visible, in the infrared, ...), when projected into the principal components, mathematically we know that the original physical data are redistributed and represented by new variables hidden which have lost the original physical meaning even if evaluating the explained variances, they give us the quantitative information of the energetic content, of each component, in this new space. In various applications, this property of the principal components is used in reducing the dimensionality of the data considering the first p most significant components of the M-original dimensions. For example, the first 3 principal components can be used as the components of a suitable color space (RGB, ...) to display a multispectral image of dozens of bands (see an example shown in Sect. 2.10.1 Vol. II). Another example concerns the aspects of image compression where the analysis of the principal components is strategic to evaluate the feasible compression level even if then the data compression is performed with computationally more performing transforms (see Chap. 2 Vol. II). In the context of clustering, the analysis of principal components is useful in selecting the most significant features and eliminating the redundant ones. The analyst can decide the acceptable d percentage of variance explained keeping the first p components calculable (considering Eq. 1.5) with the following ratio: p λk (1.10) d = 100 k=1 M k=1 λk Once calculated the number of components p that contribute to maintaining the percentage d of the total variance, the data are transformed (projected in a space with p < M dimensions) applying the Eq. (1.3) to the first p components using the first p eigenvectors ak , k = 1, . . . , p of the transformation matrix A p that now has M × p dimensions. In Sect. 2.10.1 Vol. II, we demonstrated the efficacy of PCA applied to multispectral images (5 bands) where the first principal component manages to maintain more than 90% of the total variance due also to the high correlation of the bands. Other heuristic approaches are proposed in the literature to select the principal components. The so-called Kaiser rule [6] proposes to select only the components associated with eigenvalues with values λi ≤ 1 (almost always correspond to the first two components). Often this criterion is used but imposes a threshold of 0.7 instead of 1, to select more components by increasing the variability of the observed samples. A more practical method is to calculate the mean of the variances, that is the M mean value of the eigenvalues λ = M1 k=1 λk , and select the first p components whose variance exceeds this average, with p the largest value of k such that λk > λ.

1.4 Extraction of Significant Features

17

Another approach is to graph the value of the eigenvalues on the ordinates with respect to their order of extraction and choose the number p of the most significant components where an abrupt change of the slope occurs with the rest of the graph almost flat.

1.5 Interactive Method A very simple interactive classification method is to consider the feature space as a look-up table. At each point in the feature space, a number is identified that identifies the class of the object. In this case, the number of classes is interactively determined by the expert in the features space, assuming that the corresponding clusters are well separated. In essence, the expert uses 2D or 3D projections of the selected significant features (for example, using the principal components) to visually evaluate the level of class separation. The method is direct and very fast. The classes are identified in the features space and interactively delimited with closed rectangles or polygons (in 2D projections), as in the case of Fig. 1.3, assuming strongly unrelated features. With this approach, it is the expert who interactively decides, in the space of significant features, the modalities (e.g., boundary decision) to partition and delimit the clusters by choosing the most effective 2D projections. In this way, the expert directly defines the direct association between homogeneous patterns and the class to which they belong (i.e., the label to be assigned for each type of territory). Figure 1.6 shows the functional scheme of this interactive method to directly produce a thematic map of territorial classification. This approach proves useful in the preprocessing phase of the multispectral images to extract the sample areas (e.g., training set) belonging to parts of the territory (vegetation, water, bare soil, ...) to be used later in the methods of automatic classification. A typical 2D projection is shown in Fig. 1.4 where we can see how the class of A patterns is easily separable while there is a greater correlation between the features x1 and x2 for the patterns of class B (very correlated, well distributed on the positive diagonal line) and a lower correlation for the C class patterns. Although with some errors due to the slight overlapping of the clusters, it would be possible to separate the B and C classes to extract samples to be used as training set for subsequent elaborations.

1.6 Deterministic Method More generally, we can think of a deterministic classifier as an operator that provides input pattern x with M features and in output the single value yr of identification (labeling) of the pattern between the R classes ωr , r = 1, . . . , R possible. This

18

1 Object Recognition

Multispectral image Band x

x 4

1

3

2

4 4 1

1

1

3

1

(b) Features Domain

3 2

3 2

2

Banda xx Band

(a) Space Domain

4

1

(c) Look-up Table

(d) Thematic Map

Fig. 1.6 Functional scheme of the interactive deterministic method. a Spatial domain represented by two bands (x1 , x2 ); b 2D features space where the expert checks how the patterns cluster in nonoverlapping clusters; c definition of the look-up table after having interactively defined the limits of separation of classes in the features domain; d thematic map, obtained using the values of the features x1 , x2 of each pixel P as a pointer to the 2D look-up table, where the classes to be associated are stored

operator can be defined in the form: d(x) = yr

(1.11)

where d(x) is called decision function or discriminant function. While in the interactive classification method, the regions associated with the different classes were defined by the user observing the data projections in the features domain, with the deterministic method, instead, these regions are delimited by the decision functions that are defined by analyzing sample data for each observable class. The decision functions divide into practice the space of the disjoint features in R classes ωr , r = 1, . . . , R, each of which constitutes the subset of the pattern vectors x to Mdimensions for which the decision d(x) = yr is valid. The ωr classes are separated by discriminating hypersurfaces. In relation to the R classes ωr , the discriminating hypersurfaces can be defined by means of the scalar functions dr (x) which are precisely the discriminating functions of the classifier with the following property: dr (x) > ds (x) ⇒ x ∈ ωr

s = 1, . . . , R;

s = r

(1.12)

The discriminating hypersurface is given by the following: dr (x) − ds (x) = 0

(1.13)

That said, with the (1.12), a pattern vector x is associated with the class with the largest value of the discriminate function: d(x) = yr

⇐⇒

dr (x) = arg max ds (x)

(1.14)

s=1,...,R

Different are the discriminating functions used in literature, linear (defined with d(x) as a linear combination of features x j ) and nonlinear multiparametric (defined as d(x, γ ) where γ represents the parameters of the model d to be defined in the

1.6 Deterministic Method

19

training phase, note the sample patterns, as is the case for the multilevel perceptron). Discriminant functions can also be considered as a linear regression of data to a model where y is the class to be assigned (the dependent variable) and the regressors are the pattern vectors (the independent variables).

1.6.1 Linear Discriminant Functions The linear discriminant functions are the simplest and are normally the most used. They are obtained as a linear combination of the features of the x patterns: dr (x) = wrT x =

M+1  i=1

wr,i xi =

>0, x ∈ ω

r

(1.15)

0 w

Do

Dz

Z

hyperplane d(x)=0 d(x) 0, while for every x ∈ ω2 results d(x) < 0 as shown in Fig. 1.8a. In essence, d(x) results in the linear discriminant function of the class ω1 . More generally, we can say that the set of R classes are absolutely separable if each class ωr , r = 1, . . . , R is linearly separated from the remaining pattern classes (see Fig. 1.8b). For M = 3, the linear discriminant function is represented by the plane and for M > 3 by the hyperplane.

1.6.2 Generalized Discriminant Functions So far, we have considered linear discriminant functions to separate pairs of classes directly in the features domain. Separation with more complex class configurations can be done using piecewise linear functions. Even more complex situations cannot be solved with linear functions (see Fig. 1.8c). In these cases, the class boundaries can be described with the generalized linear discriminant functions given in the following form: d(x) = w1 φ1 (x) + · · · + w M φ M (x) + w M+1

(1.18)

1.6 Deterministic Method

21

where φi (x) are the M scalar functions associated with the pattern x with M features (x ∈ R M ). In vector form, introducing the aumented vectors w and z, with the substitution of the original variable x in z i = φi (x), we have d(x) =

M+1 

wi φi (x) = w T z

(1.19)

i=1

where z = (φ1 (x), . . . , φ M (x), 1)T is the vector function of x and w = (w1 , . . . , w M , w M+1 )T . The discriminant function (1.19) is linear in z i through the functions φi (i.e., in the new transformed variables) and not in the measures of the original features xi . In essence, by transforming the input patterns x, via the scalar functions φi , in the new aumented domain M+ 1 of the features z i , the classes can be separated by a linear function as described in the previous paragraph. In the literature, several functions have been proposed φi to separate linearly patterns. The most common are the discriminating functions polynomial, quadratic, radial basis,5 multilevel perceptron. For example, for M = 2, the quadratic generalized discriminant function results d(x) = w1 x 2 + w2 x1 x2 + w3 x22 + w4 x1 + w5 x2 + w6

(1.20)

with w = (w1 , . . . , w6 )T and z = (x12 , x1 x2 , x22 , x1 , x2 , 1)T . The number of weights, that is, the free parameters of the problem, is 6. For M = 3, the number of weights is 10 with the pattern vector z with 10 components. With the increase of M, the number of components of z becomes very large.

1.6.3 Fisher’s Linear Discriminant Function Fisher’s linear discriminant analysis (known in the literature as Linear Discriminant Analysis—LDA [8] represents an alternative approach to finding a linear combination of features for object classification. LDA like PCA is based on a linear transformation data to extract relevant information and reduce the dimensionality of the data. But, while the PCA represents the best linear transformation to reduce the dimensionality of the data projecting them in the direction of maximum variance on the new axes, without however obtaining a better representation of the new useful features on the separability of the classes, LDA on the other hand projects the data into the new space such as to improve the separability of the classes by maximizing the ratio between the inter-class variance and the intra-class variance (see Fig. 1.9). As shown in the figure, in the case of two classes, the basic idea is to project the sample patterns X on a line where the classes are better separated. Therefore, if v

5 Function

of real variables and real values dependent exclusively on the distance from a fixed point, called centroid xc . An RBF function is expressed in the form φ : R M → R such that φ(x) = φ(|x − xc |).

22

1 Object Recognition

(a)

(b) horiz.

vert.

v* v

axis major distance

x

ω ω

Fig. 1.9 Fisher linear discriminant function. a Optimal projection line to separate two classes of patterns ω1 and ω2 ; b nonoptimal line of separation of the two classes where the partial overlap of di f f er ent patterns is noted

is the vector that defines the orientation of the line where the patterns are projected xi , i = 1, . . . , N , the goal is to find for every xi a scalar value yi that represents the distance from the origin of its projection on the line. This distance is given by yi = v T xi

i = 1, . . . , N

(1.21)

The figure shows the projection of a sample in the one-dimensional case. Now let’s see how to determine the v or the optimal direction of the sample projection line that best separates the K classes ωk with each consisting of n k samples. In other words, we need to find v so that after the projection, the ratio of variances between classes and the intra-class ratio is maximized.

1.6.3.1 Fisher Linear Discrimination—2 Classes Using the same symbolism used so far, we initially consider a data set of samples X = {x1 , . . . , x N } with two classes of which we can calculate the average μ1 and μ2 of each class of n 1 and n 2 samples, respectively. Applying the (1.21) to the samples of the two classes, we get the average distance of the projections mu ˆ k , as follows: μˆ k =

1  1  T yi = v xi = v T μk n k x ∈ω n k x ∈ω i

k

i

k = 1, 2

(1.22)

k

A criterion for defining a separation measure of the two classes consists in considering the distance (see Fig. 1.9) between the projected averages |μˆ 2 − μˆ 1 | which represents the inter-class distance (measure of separation): J (v) = |μˆ 2 − μˆ 1 | = |v T (μ2 − μ1 )|

(1.23)

1.6 Deterministic Method

23

thus obtaining an objective function J (v) to be maximized dependent on the vector v. However, the distance between projected averages is not a robust measure since it does not take into account the standard deviation of the classes or their dispersion level. Fisher suggests maximizing the inter-class distance by normalizing with the dispersion information (scatter ) of each class. Therefore, a dispersion measure Sˆk2 is defined, analogous to the variance, for each class ωk , given by  (yi − μˆ k )2 k = 1, 2; i = 1, . . . , N (1.24) Sˆk2 = yi ∈ωk

The measure of dispersion ( Sˆ12 + Sˆ22 ) obtained is called intra-class dispersion (withinclass scatter) of the samples projected in the direction v in this case with two classes. The Fisher linear discriminating criterion is given by the linear function defined by the (1.21) which projects the samples on the line in the direction v and maximizes the following linear function: J (v) =

|μˆ 2 − μˆ 1 |2 Sˆ 2 + Sˆ 2 1

(1.25)

2

The goal of the (1.25) is to project the samples of a compact class (that is, have very small Sˆk2 ) and simultaneously project their centroids very far apart as possible (i.e., the very large distance |μˆ 2 − μˆ 1 |2 ). This is achieved by adequately finding a vector v∗ which maximizes J (v) through the following procedure. 1. Calculation of the dispersion matrices Sk in the space of origin of the f eatur e x:  Sˆk = (xi − μˆ k )(xi − μˆ k )T k = 1, 2; i = 1, . . . , N (1.26) xi ∈ωk

from which we obtain the intra-class dispersion matrix Sv = S1 + S2 . 2. The dispersion of the projections y can be expressed as a function of the dispersion matrix Sv in the feature space x, as follows:  (yi − μˆ k )2 for the (1.24) Sˆk2 = yi ∈ωk

=



(v T xi − v T μk )2

for the (1.21) and (1.22)

yi ∈ωk

=



(1.27) (v (xi − μk )(xi − μk ) v T

T

yi ∈ωk

= v T Sk v where Sk is the dispersion matrix in the source space of the f eatur e. From the (1.27), we get (1.28) Sˆ12 + Sˆ22 = v T Sv v

24

1 Object Recognition

3. Definition of the inter-class dispersion matrix in the source space, given by S B = (μ1 − μ2 )(μ1 − μ2 )T

(1.29)

which includes the separation measures between the centroids of the two classes before the projection. It is observed that S B is obtained from the external product of two vectors and has at most one rank. 4. Calculation of the difference between the centroids, after the projection, expressed in terms of the averages in the space of the features of origin: (μˆ 1 − μˆ 2 )2 = (v T μ1 − v T μ2 )2 = v T (μ1 − μ2 )(μ1 − μ2 )T v

(1.30)

= v SB v T

5. Reformulation of the objective function (1.25) in terms of the dispersion matrices Sv and S B , expressed as follows: J (v) =

|μˆ 2 − μˆ 1 |2 vT S B v = T v Sv v Sˆ12 + Sˆ22

(1.31)

6. Find the maximum of the objective function J (v). This is achieved by deriving J with respect to the vector v and setting the result to zero.   d d vT S B v J (v) = dv dv v T Sv v

T T T v Sv v d[v dvS B v] − [v T S B v] d[vdvSv v] = (1.32) (v T Sv v)2 [v T Sv v]2S B v − [v T S B v]2Sv v =0 (v T Sv v)2 =⇒ [v T Sv v]2S B v − [v T S B v]2Sv v = 0

=

and dividing the last expression for v T Sv v, we have    T  T v Sv v v SB v S Sv v = 0 v − B v T Sv v v T Sv v =⇒ S B v − J (v)Sv v = 0 =⇒ Sv−1 S B v − J (v)v = 0

(1.33)

7. Solving the problem with the eigenvalue method generalized with the Eq. (1.33) if Sv has complete rank (with the existence of the inverse matrix). Solving from v, its maximum value v∗ is obtained as follows:  T  v SB v = Sv−1 (μ1 − μ2 ) v∗ = arg maxv J (v) = arg maxv (1.34) v T Sv v

1.6 Deterministic Method

25

Fig. 1.10 Fisher’s multiclass linear discriminant function with 3-class example and samples with 2D features

S

S

SB

SB

SB

μ

S

0

We thus obtained with the (1.34) the Fisher linear discriminant function although more than a discriminant it is rather an appropriate choice of the direction of the one-dimensional projection of the data.

1.6.3.2 Fisher Linear Discrimination—C Classes Fisher’s LDA extension for C classes can be generalized with good results. Assume we have a dataset X = {x1 , x1 , . . . , x N } of N samples with d-dimensions belonging to C classes. In this case, instead of a projection y, we have C − 1 projections y = {y1 , y2 , . . . , yC−1 } in the linear subspaces given by yi = viT x

=⇒

T y[C−1]×1 = Vd×[C−1] Xd×N

(1.35)

where the second equation expresses in matrix compact form the vector y of the C − 1 projections generated by the C − 1 projection vectors vi assembled in the C − 1 columns of the projection matrix V. Let us now see how the equations seen above for LDA to C-classes are generalized (Fig. 1.10 presents an example of 2D features with 3 classes and samples with 2 dimensions). 1. The intra-class dispersion matrix for 2 classes Sv = S1 + S2 for C − 1 classes is so generalized: C  Sv = Si (1.36) i=1

where Si =

 x j ∈ωi

(x j − μi )(x j − μi )T

μi =

1  xj n i x ∈ω j

i

(1.37)

26

1 Object Recognition

with n i indicating the number of samples in the class ωi . 2. The inter-class dispersion matrix for 2 classes, the equation (1.29), which measures the distance between classes considering the centroids (i.e., the averages μi of the classes) is so generalized: SB =

C 

n i (μi − μ)(μi − μ)T

con

μ=

i=1

N C 1  1  xj = n i μi N N j=1

(1.38)

i=1

 where μi = 1/n i x j ∈ωi x j is the average of the samples of each class ωi , each of n i samples, while N is the total number of samples. The total dispersion matrix is given by ST = S B + SV . 3. Similarly, the average vector μˆ i of the projected samples y of each class and of the total average μˆ are defined: μˆ i =

1  yj n i y ∈ω j

μˆ =

C−1 1  yi N

(1.39)

i=1

i

4. The dispersion matrices of the projected samples y result Sˆ V =

C  i=1

Sˆi =

C  

(y j − μˆ i )(y j − μˆ i )T

Sˆ B =

i=1 y j ∈ωi

C 

n i (μi − μ)(μ ˆ ˆ T i − μ)

(1.40)

i=1

5. In the 2-class approach, we expressed the dispersion matrices of the projected samples in terms of the original samples. For C classes, we have Sˆ V = VT SV V

(1.41)

The dispersion matrix Sˆ B = VT S B V remains the same valid for LDA with C classes. 6. Our goal is to find a projection that maximizes the relationship between inter-class and intra-class dispersion. Since the projection is no longer one-dimensional but has dimensions C − 1, the determinant of the dispersion matrices is used to obtain a scalar objective function, as follows: J (V) =

|VT S B V| |Sˆ B | = T |V SV V| |Sˆ V |

(1.42)

It is now necessary to find the projections defined by the column vectors of the projection matrix V, or a projection matrix V∗ that maximizes the ratio of the objective function J (V). 7. Calculation of the matrix V∗ . In analogy to the 2-class case, the maximum of J (V) is found differentiating the objective function (1.42) and equalizing the result to zero. Subsequently, the problem with the eigenvalue method is solved by generalizing the Eq. (1.33) previously obtained for 2 classes. It is shown that the optimal projection matrix V∗ is the matrix whose columns are the eigenvectors

1.6 Deterministic Method

27

(a)

(b) 12

component

8 6 4

X2 − Second

2 0 −2

FDA PCA

−4 −4

−2

0

2

4

6

8

10

st Fir

8

c ve

r to

6 4 2 0 −2 Second vector

−4 −6 −4

−2

0

2

X1 − First

4

6

8

10

component

Fig. 1.11 Application of LDA for a 2D dataset with 2 and 3 classes. a Calculation of the linear discriminant projection vector for the 2-class example. In the figure produced with M AT L AB, the main component of PCA is reported, together with the projection vector FDA. It is highlighted how FDA better separates the classes from the PCA whose principal component is more oriented to highlight the greater variance of data distribution. b Calculation of the 2 projection vectors for the example with 3 classes

that correspond to the greatest eigenvalues of the following generalization (the analogue of Eq. 1.33) of the eigenvalue problem: ∗ V∗ = [v1∗ |v2∗ | · · · |vC−1 ] = arg maxV

|VT S B V| ∗ ∗ =⇒ S−1 V S B vi = λi vi |VT SV V|

(1.43)

where λi = J (vi∗ ) with i = 1, 2, ..., C − 1. From (1.43), we have that if SV is invertible (and it is also a non-singular matrix), then the Fisher objective function is maximized when the projection matrix V∗ has the projection columns corresponding to the eigenvectors associated with the greatest eigenvalues λi of S−1 V S B . The matrix S B is the sum of the C matrices with rank ≤ 1 so that the corresponding rank is ≤ (C − 1) and consequently only C − 1 eigenvalues will be different from zero. Figure 1.11 shows the application of Fisher’s discriminant analysis (FDA) for two datasets with 2 features but with 2 and 3 classes. Figure (a) also shows the principal component of the PCA applied for the same 2-class dataset. As previously highlighted, PCA tends to project data in the direction of maximum variance which is useful for concentrating data information on a few possible components while it is less useful for separating classes. FDA on the other hand determines the projection vectors where the data are better separated and therefore more useful for classification. Let us now analyze some limitations of the LDA. A first aspect concerns the reduction of the dimensionality of the data which is only of C − 1, unlike the PCA which can reduce the dimensionality up to a feature. For complex data, not even the best one-dimensional projection can separate samples of different classes. Similarly

28

1 Object Recognition

to the PCA, if the classes are very large with a very large J (v) value, the classes have large overlaps on any projection line. LDA is in fact a parametric approach in that it assumes a tendentially Gaussian and unimodal distribution of the samples. For the classification problem, if the distributions are significantly non-Gaussian, the LDA projections will not be able to correctly separate complex data (see Fig. 1.8c). In literature, there are several variants of LDA [9,10] (nonparametric, orthonormal, generalized, and in combination with neural networks).

1.6.4 Classifier Based on Minimum Distance 1.6.4.1 Single Prototypes Let’s say we know a prototype pattern pr for each of the R classes ωr , r = 1, . . . , R of patterns with M features. A classifier based on minimum distance assigns (classifies) a generic pattern x to class ωr whose prototype pr results at the minimum distance: d(x) = ωr

⇐⇒

|x − pr | = min |x − pi | i=1,...,R

(1.44)

If there are several minimum candidates, the pattern x is assigned to the class ωr corresponding to the first r-th found. This classifier can be considered as a special case of a classifier based on discriminating functions. In fact, if we consider the Euclidean distance Di between the generic pattern x and the prototype pi , we have that this pattern is assigned to the class ωi which satisfies the relation Di < D j for all j = i. Finding the minimum of Di is equivalent to finding the minimum of Di2 (being the positive distances) for which we have   1 Di2 = |x − pi |2 = (x − pi )T (x − pi ) = x T x − 2 x T pi − piT pi 2

(1.45)

where in the final expression, the term xT x can be neglected being independent of the i index and the problem is reduced to maximize the expression in brackets (•). It follows that we can express the classifier in terms of the following discriminant function: 1 i = 1, . . . , R (1.46) di (x) = x T pi − piT pi 2 and the generic pattern x is assigned to the class ωi if di (x) > d j (x) for each j = i. The discriminant functions di (x) are linear expressed in the form: di (x) = wiT x

i = 1, . . . , R

(1.47)

where x is given in the form of an augmented vector (x1 , . . . , x M , 1)T , while the weights wi = (wi1 , . . . , wi M , wi,M+1 ) are determined as follows: wi j = pi j

1 wi,M+1 = − piT pi 2

i = 1, . . . , R; j = 1, . . . , M

(1.48)

1.6 Deterministic Method

29

Fig. 1.12 Minimum distance classifier with only one prototype pi per class. The x pattern to be classified is assigned to the ω2 class, making it closer to the p2 to the distant D2 . The lines with separation lines associated for each of the three prototypes are noted

x

Discriminant lines

0

It can be shown that the discriminating surface that separates each pair of prototype patterns pi and p j is the hyperplane that bisects perpendicularly the segment joining the two prototypes. Figure 1.12 shows an example of minimum distance classification for a single prototype with three classes. If the prototypes patterns pi match the average patterns μi of the class, we have a classifier of minimum distance from the average.

1.6.4.2 Multiple Prototypes The extension with multiple prototypes per class is immediate. In this case, we hypothesize that in the generic class ωi , a set of prototype patterns are aggregated (1) (n ) (classi f ied) pi , . . . , pi i , i = 1, . . . , R, where n i indicates the number of prototypes in the i-th class. The calculation of the distance between a x pattern to classify and the generic class ωi is given by k = 1, . . . , n i Di = min x − pi(k) (1.49) k

As before, also in this case, the discriminant function is determined for the generic class ωi as follows: (k) (1.50) di (x) = max di (x) k=1,...,n i

where di(k) (x) is a linear discriminant function given by 1 di(k) (x) = x T pi(k) − (pi(k) )T pi(k) 2

i = 1, . . . , R;

k = 1, . . . , n i

(1.51)

The x pattern is assigned to the ωi class for which the discriminant function di (x) assumes the maximum value, i.e., di (x) > d j (x) for each j = i. In other words, the x

30

1 Object Recognition

pattern is assigned to the ωi class which has the closest pattern prototype. The  Rlinear ni discriminant functions given by the (1.51) partition the features space into i=1 regions, known in the literature as the Dirichlet tessellation.6

1.6.5 Nearest-Neighbor Classifier This classifier can be considered as a generalization of the classification scheme based on the minimum distance. Suppose we have a training set of pairs (pi , z i ), i = 1, . . . , n where pi indicates the sample pattern of which we know a priori its class z i = j of belonging, that is, it results a sample of the class ω j , j = 1, . . . , R. If we denote by x the generic pattern to be classified, the Nearest-neighbor (NN) classifier assigns it to the class of the pair i-th whose sample pi results closer in terms of their distance metrics, namely Di = |x − pi | = min |x − pk | 1≤k≤n



(pi , z i );

x ∈ ωz i

(1.52)

A version of this classifier, called k-nearest neighbor (k-NN), operates as follows: (a) Determine between n sample-class pairs (pi , z i ), k samples closer to the pattern x to classify (always considering the distance with an appropriate metric). (b) The class to assign to x is the most representative class (the most voted class), that is, the class that has the greatest number of samples among the nearest k found. With the classifier k-NN, the probability of erroneous attribution of the class is reduced. Obviously the choice of k must be adequate. A high value reduces the sensitivity to data noise, while a very low value reduces the possibility of extending the concept of proximity to the domain of other classes. Finally, it should be noted that with the increase of k, the probability of error of the classifier k-NN approaches the probability of the Bayes classifier which will be described in the following paragraphs.

1.6.6 K-means Classifier The K-means method [11] is also known as C-means clustering, applied in different contexts, including the compression of images and vocal signals, and the recognition of thematic areas for satellite images. Compared to the previous supervised classifiers, 6 Also

called the Voronoi diagram (from the name of Georgij Voronoi), it is a particular type of decomposition of a metric space, determined by the distances with respect to a given finite set of space points. For example, in the plane, given a finite set of points S, the Dirichlet tessellation for S is the partition of the plane that associates a region R( p) to each point p ∈ S, so such that, all points of R( p) are closer to p than to any other point in S.

1.6 Deterministic Method

31

K-means does not have a priori knowledge of the patterns to be classified. The only information available is the number of classes k in which to group the patterns. So far, we have adopted a clustering criterion based on the minimum Euclidean distance to establish a similarity7 measure between two patterns to decide whether they are elements of the same class or not. Furthermore, this similarity measure can be considered r elative by associating a thr eshold that defines the acceptability level of a pattern as similar to another or belonging to another class. K-means introduces a clustering criterion based on a performance index that minimizes the sum of the squares of the distances between all the points of each cluster with respect to its own cluster center. N consisting of N observations Suppose we have the dataset available X = {xi }i=1 of a physical phenomenon M-dimensional. The goal is to partition the dataset into a number K of groups.8 Each partition group is represented by a prototype that on average has an intra-class distance9 smaller than distances taken between the prototype of the group and an observation belonging to another group (inter-class distance). Then we represent with μk , a vector M-dimensional representing the prototype of the k-th group (with k = 1, . . . , K ). In other words, the prototype represents the center of the group. We are interested in finding the set of prototypes of the X dataset with the aforementioned clustering criterion, so that the sum of the squares of the distances of each observation xi with the nearest prototype is minimal. We now introduce a notation to define the way to assign each observation to a prototype with the following binary variable rik = 0, 1 indicating whether the i-th observation belongs to the k-th group if rik = 1, or rik = 0 if it belongs to some other group other than k. In general, we will have a matrix R of membership with dimension N × K of the binary type which highlights whether the i-th observation belongs to the k-th class. Suppose for now we have the K prototypes μ1 , μ2 , . . . , μ K (later we will see how to calculate them analytically), and therefore we will say that an observation x belongs to the class ωk if the following is satisfied:

x − μk = min x − μ j

j=1,K



x ∈ ωk

(1.53)

At this point, temporarily assigning all the dataset patterns to the K cluster, one can evaluate the error that is made in electing the prototype of each group, introducing a functional named distortion measure of the data or total reconstruction error with the following function: J=

N  K 

rik xi − μk 2

(1.54)

i=1 k=1

7 Given two patterns x and y, a measure of similarity S(mathb f x, y) can be defined as lim x→y S(x, y) = 0 ⇒ x = y. 8 Suppose at the moment we know the number K of groups to search for. 9 The distance between all the observations belonging to the same group with the representative prototype of the group.

32

1 Object Recognition

  The goal is then to find the values of {rik } and μk in order to minimize the objective function J . An iterative procedure is used which involves two different steps for each iteration. The cluster centers are initialized μk , k = 1, . . . , K randomly. So for each iteration, in a first phase, J is minimized with respect to rik and keeping the centers fixed μk . In the second phase, J is minimized with respect to μk but keeping the membership functions fixed (also called characteristic functions of the rik classes. These two steps are repeated until convergence is reached. We will see later that these two described update steps of rik and μk correspond to the E (Expectation) and M (Maximization) steps of the E M algorithm. If the membership function rik is 1, it tells us that the vector xi is closer to the center μk , that is, we assign each point of the dataset to the nearest cluster center as follows:  1 If k = arg min xi − μ j

j rik = (1.55) 0 otherwise Since a given observation x can belong to only one group, the R matrix has the following property: K  rik = 1 ∀i = 1, . . . , N (1.56) k=1

and

K  N 

rik = N .

(1.57)

k=1 i=1

We now derive the update formulas for rik and μk in order to minimize the function J . If we consider the optimization of μk with respect to rik fixed, we can see that the function J in (1.54) is a quadratic function of μk , which can be minimized by setting the first derivative to zero:    ∂J =2 rik xi − μk = 0 ∂μk

(1.58)

N rik xi μk = i=1 N i=1 rik

(1.59)

n

i=1

from which we get

Note that the denominator of (1.59) represents the number of points assigned to the k-th cluster, i.e., it calculates μk as the average of the points that fall within the cluster. For this reason, it is called as K-means. So far we have described the batch version of the algorithm, in which the whole dataset is used in a single solution to update the prototypes, as described in the Algorithm 1. A stochastic online version of the algorithm has been proposed in the

1.6 Deterministic Method

33

literature [12] by applying the Robbins–Monro procedure to the problem of searching for the roots of the regression function given by the derivatives of J with respect to μk . This allows us to formulate a sequential version of the update as follows:   old (1.60) μnew = μold k k + ηi xi − μk with ηi the learning parameter that is monotonically decreased based on the number of observations that compose the dataset. In Fig. 1.13 is shown the result of the quantization or classification of color pixels. In particular, in Fig. 1.13a, the original image is given and in the following ones, the result of the method with different values of prototypes K = 3, 5, 6, 7, 8. Each color indicates a particular cluster, so the value (representing the RGB color trio) of the nearest prototype has been replaced by the original pixel. The computational load is easy O(K N t) where t indicates the number of iterations, K the number of clusters, and N the number of patterns to classify. In general, we have K , t  N . Algorithm 1 K-means algorithm 1: Initialize the centers μk for 1 ≤ k ≤ K randomly 2: repeat 3:

for xi with i = 1, . . . , N do 

4: rik = 5: 6:

end for for μk with k = 1, 2, . . . , K do

7:

8:

1 I f k = arg min j xi − μ j

0 other wise

N rik xi μk = i=1 n i=1 rik end for

9: until convergence of parameters

1.6.6.1 K-means Limitations The solution found with the K-means algorithm, that is the prototypes of the K classes, can be evaluated in terms of distances between datasets and the same prototypes. This distance information provides a measure of data distortion. Different solutions are obtained with the different initializations that can be set to the algorithm. The value of the distortion can guide toward the best solution found on the

34

1 Object Recognition

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 1.13 Classification of RGB pixels with the K-means method of the image in (a). In b for K = 3; c for K = 5; d for K = 6; e for K = 7; and f for K = 8

same dataset of data, or the minimum distortion. It is often useful to proceed by trial and error by varying the number of K classes. In general, this algorithm converges in a dozen steps, although there is no rigorous proof of its convergence. It is also influenced by the order in which the patterns are presented. Furthermore, it is sensitive to noise and outliers. In fact, a small number of the latter can substantially influence the average value. Not suitable when cluster distribution has non-convex geometric shapes. Another limitation is given by the membership variables or also called z it responsibility variables that assign the data i-th to the cluster t in a har d way or binary way. In the fuzzy C-means, but also in the mixture of Gaussians dealt with below, these variables are treated so f t or with values that vary between zero and one.

1.6.7 ISODATA Classifier The ISODATA classifier [13] (Iterative Self-Organizing Data Analysis Technique A10 ) is seen as an improvement to the K-means classifier. In fact, the results of Kmeans can produce clusters that contain few patterns to be insignificant. Some clusters may also be very close to each other and can be useful to combine them together (merging). Other clusters, on the other hand, can show very elongated geometric configurations and in this case, it can be useful to divide them into two new clusters (splitting) on the basis of predefined criteria such as the use of a threshold for the standard deviation calculated for each cluster and the calculated distance between cluster centers [14].

10 The

final A is added to simplify the pronunciation.

1.6 Deterministic Method

35

The ISODATA procedure requires several input parameters: the approximate desired number K of the clusters, the minimum number Nminc of pattern per cluster, maximum standard deviation σs to be used as the threshold for decomposing clusters, maximum distance Dunion permissible for the union, and maximum number of Nmxunion clusters eligible for the union. The Essential steps of the ISODATA iterative procedure are as follows: 1. 2. 3. 4. 5.

Apply the K-means clustering procedure. Delete clusters with a few pixels according to Nminc . Merge pairs of clusters that are very close to each other according to Dunion . Divide large clusters, according to σs , presumably with dissimilar patterns. I terate the procedure starting from step 1 or end if the maximum number of admissible iterations is reached.

The ISODATA classifier was applied with good results for multispectral images with a high number of bands. The heuristics adopted, in order to limit the little significant clusters together with the ability to divide and unite the dissimilar clusters and similar clusters, respectively, make the classifier very flexible and effective. The problem remains that geometrically curved clusters, even with ISODATA, are difficult to manage. Obviously, the initial parameters must be better defined, with different attempts, repeating the procedure several times. As the K-means also ISODATA does not guarantee convergence a priori, even if, in real applications, with clusters not very overlapping, convergence is obtained after dozens of iterations.

1.6.8 Fuzzy C-means Classifier Fuzzy C-Means (FCM) classifier is the f uzzy version of the K-means and is characterized by the Fuzzy theory which allows three conditions: 1. Each pattern can belong to a certain degree of probability to multiple classes. 2. The sum of the degree to which each pattern belongs to all clusters must be equal to 1. 3. The sum of the memberships of all the patterns in each cluster cannot exceed N (the set of patterns). The f uzzy version of the K-means proposed by Bezdek [15], also known as Fuzzy ISODATA, differs from the previous one for the membership function. In this algorithm, each pattern x has a membership function r of the smooth type, i.e., it is not binary but defines the degree to which the data belongs to each cluster. This algorithm N of N observations in K groups fuzzy, and find partitions the dataset X = {xi }i=1 cluster centers in a similar way to K-means, such as to minimize the similarity cost function J . So the partitioning of the dataset is done in a fuzzy way, so as to have for each given xi ∈ X a membership value of each cluster between 0 and 1. Therefore, the membership matrix R is not binary, but has values between 0 and 1. In any case,

36

1 Object Recognition

the condition (1.56) is imposed. So the objective function becomes J=

N  K 

m rik

xi − μk 2

(1.61)

i=1 k=1

with m ∈ [1, ∞) an exponent representing a weight (fuzzification constant, which is the fuzziness level of the classifier). The necessary condition for the search for the minimum of the (1.61) can be found by introducing the Lagrange multipliers λi which define the following cost function subject to the n constraints of the (1.56):   N K J = J + i=1 λi k=1 rik − 1   (1.62) N K N m K 2+ r

x − μ

λ r − 1 = k=1 i i ik k i=1 ik i=1 k=1 Now differentiating the (1.62) with respect to μk , λi and setting zero to following: n mx rik i μk = i=1 , (1.63) n m i=1 rik and rik =

1  K  xi −μk 2/(m−1) t=1

k = 1, . . . , K ; i = 1, . . . , N

(1.64)

xi −μt

In the batch version, the algorithm is reported in Algorithm 2. We observe the iterativity of the algorithm that alternately determines the centroids μk of the clusters and the memberships rik until convergence. It should be noted that, if the exponent m = 1 in the objective function (1.61), the algorithm fuzzy C-means approximates the hard algorithm K-means. Since the level of belonging of the patterns to the clusters produced by the algorithm, they become 0 and 1. At the extreme value m → ∞, the objective function has value J → 0. Normally m is chosen equal to 2. The FCM classifier is often applied in particular in the classification of multispectral images. However, performance remains limited by the intrinsic geometry of the clusters. As for the K-means also for FCM, an elongated or curved grouping of the patterns in the features space can produce unrealistic results. Algorithm 2 Fuzzy C-means algorithm 1: Initialize the membership matrix R with random values between 0 and 1 and keeping Eq. (1.56) satisfied. 2: Calculate the fuzzy cluster centers with the (1.63) ∀ k = 1, . . . K . 3: Calculate the cost function according to the Eq. (1.61). Evaluate the stop criterion if J < T hr eshold or if the difference Jt − Jt−1 < T hr eshold with t the progress step. 4: Calculate the new matrix R using (1.64) and go to step 2.

1.7 Statistical Method

37

1.7 Statistical Method The statistical approach, in analogy to the deterministic one, uses a set of decision rules based, however, on statistical theory. In particular, the discriminating functions can be constructed by estimating the density functions and applying the Bayes rules. In this case, the proposed classifiers are of the parametric type extracting information directly from the observations.

1.7.1 MAP Classifier A deterministic classifier systematically assigns a pattern to a given class. In reality, a vector pattern x, described by the features x1 , . . . , x M , can assume values such that a classifier can incorrectly associate it with one of the classes ωk , k = 1, . . . , K . Therefore, it emerges the need to use an optimal classifier that can identify patterns based on criteria with minimal error. Bayes’ decision theory is based on the foundation that the decision/classification problem can be formulated in stochastic terms in the hypothesis that the densities of the variables are known or estimated from the observed data. Moreover, this theory can estimate costs or risks in a probabilistic sense related to the decisions taken. Let x be the pattern to be classified in one of the K classes ω1 , . . . , ω K of which we know the priori probabilities p(ω1 ), . . . , p(ω K ) (which are independent of observations). A simple decision rule can be used, to minimize the probability of error of assignment of the class ωi i, which is the following: p(ωi ) > p(ωk )

k = 1, . . . , K ; k = i

(1.65)

This rule assigns all the patterns to a class, that is, the class with the highest priori probability. This rule makes sense if the a priori probabilities of the classes are very different between them, that is, p(ωi )  p(ωk ). We can now assume to know, for each class ωi , an adequate number of sample patterns x, from which we can evaluate the conditional probability distribution p(x|) of x given the class ωi , that is, estimating the probability density of x assuming the association to the class ωi . At this point, it is possible to adopt a probabilistic decision rule to associate a generic pattern x to a class ωi , in terms of conditional probability, if the probability p(ωi ) of the class ωi given the generic pattern x, or p(ωi |x), is greater than all other classes. In other words, the generic pattern is assigned x to the class ωi if the following condition is satisfied: p(ωi |x) > p(ωk |x)

k = 1, . . . , K ; k = i

(1.66)

The probability p(ωi |x) is known as the posterior probability of the class ωi given x, that is, the probability that having observed the generic pattern x, the class to which

38

1 Object Recognition

it belongs is ωi . This probability can be estimated with the Bayes theorem11 : p(ωi |x) =

p(x|ωi ) p(ωi ) p(x)

(1.67)

where (a) ωi is the class, not known, to be estimated, to associate it with the observed pattern x; (b) p(ωi ) is the priori probability of the class ωi , that is, it represents part of our knowledge with respect to which the classification is based (they can also be equiprobable); (c) p(x|ωi ) is the conditional probability density function of the class, interpreted as the likeli hood of the pattern x which occurs when its features are known to belong to the class ωi ;

11 The

Bayes theorem can be derived from the definition of conditional probability and the total probability theorem. If A and B are two events, the probability of the event A when the event B has already occurred is given by p(A|B) =

p(A ∩ B) if p(B) > 0 p(B)

and is called conditional probability of A conditioned on B or simply probability of A given B. The denominator p(B) simply normalizes the joint probability p(A, B) of the events that occur together with B. If we consider the space S of the events partitioned into B1 , . . . , B K , any event A can be represented as A = A ∩ S = A ∩ (B1 ∪ B2 , . . . , B K ) = (A ∩ B1 ) ∪ (A ∩ B2 ), . . . , (A ∩ B K ). If B1 , . . . , B K are mutually exclusive, we have that p(A) = p(A ∩ B1 )+, · · · + p(A ∩ B K ) and replacing the conditional probabilities, the total probability of any A event is given by p(A) = p(A|B1 )P(B1 ) + · · · + p(A|B K ) p(B K ) =

K 

p(A|Bk ) p(Bk )

k=1

By combining the definitions of conditional probability and the total probability theorem, we obtain the probability of the event Bi , if we suppose that the event A happened, with the following: p(Bi |A) =

p(A ∩ Bi ) p(A|Bi ) p(Bi ) = K p(A) k=1 p(A|Bk ) p(Bk )

known as the Bayes Rule or Theorem which represents one of the most important relations in the field of statistics.

1.7 Statistical Method

39

(d) p(x) is known as evidence, i.e., the absolute probability density given by p(x) =

K 

p(x|ωk ) p(ωk )

with

k=1

K 

p(ωk ) = 1

(1.68)

k=1

which represents a normalization constant and does not influence the decision. From the (1.67), the discriminant functions dk (x) can be considered as dk (x) = p(x|ωk ) p(ωk )

k = 1, . . . , K

(1.69)

which unless a constant factor corresponds to the value of the posterior probability p(ωk |x) which expresses how often a pattern x belongs to the class ωk . The (1.65) can, therefore, be rewritten, in terms of the optimal rule, to classify the generic pattern x and associate it with a class ωk if the posterior probability p(ωk |x) is the highest of all possible a posteriori probabilities: p(ωk |x) = arg max p(ωi |x)

(1.70)

i=1,...,K

known as the maximum a posteriori (MAP) probability decision rule. Also known as the Bayes optimal rule for the minimum error of classification.

1.7.2 Maximum Likelihood Classifier—ML The MAP decision rule12 can be re-expressed in another form. For simplicity, we consider the rule for two classes ω1 and ω2 , and we apply the rule defined by the (1.66) which assigns a generic pattern x to the class which has a posterior probability highest. In this case, applying the Bayes rule (1.67) to the (1.66), and eliminating the common term p(x), we would have p(x|ω1 ) p(ω1 ) > p(x|ω2 ) p(ω2 )

(1.71)

which assigns x to the class ω1 if satisfied, otherwise to the class ω2 . This last relationship we can rewrite it as follows: (x) =

p(ω2 ) p(x|ω1 ) > p(x|ω2 ) p(ω1 )

(1.72)

which assigns x to the class ω1 if satisfied, otherwise to the class ω2 . (x) is called likelihood ratio and the corresponding decision rule is known as likelihood test. We

12 In

the Bayesian statistic, MAP (Maximum A Posteriori) indicates an estimate of an unknown quantity, that equals the mode of the probability posterior distribution. In essence, mode is the value that happens frequently in a distribution (peak value).

40

1 Object Recognition

Fig. 1.14 The conditional density functions of classes and their decision regions

p(x|

observe that in the likelihood test, the evidence p(x) does not appear (while it is necessary for the MAP rule for the calculation of the posterior probability p(ωk |x)), since it is a constant not influenced by the class ωk . The de facto likelihood test is a test that estimates how good the assignment decision is based on the comparison between the a priori knowledge ratios, i.e., conditional probabilities (likelihood) and a priori probabilities. If these latter p(ωk ) turn out to be equiprobable, then the test is performed only by comparing the likelihoods p(x|ωk ), thus becoming the rule of ML, Maximum Likelihood. This last rule is also used when the p(ωk ) are not known. The decision rule (1.71) can also be expressed in geometric terms by defining the decision regions. Figure 1.14 shows the decision regions 1 and 2 for the separation of two classes assuming the classes ω1 and ω2 both with Gaussian distribution. In the figure, the graphs of p(x|ωi ) p(ωi ), i = 1, 2 are displayed, with the priori probabilities p(ωi ) different. The theoretical boundary of the two regions is determined by p(x|ω1 ) p(ω1 ) = p(x|ω2 ) p(ω2 ) In the figure, the boundary corresponds to the point of intersection of the two Gaussians. Alternatively the boundary can be determined by calculating the likelihood ratio (x) and setting a threshold θ = ω2 /ω1 . Therefore, with the likelihood test, the decision regions would result 1 = {x ∈  : (x) > θ }

and

2 = {x ∈  : (x) < θ }

1.7.2.1 Example of Nonparametric Bayesian Classification We are interested in classifying the land areas (class ω1 ) and water (class ω2 ) of a territory having two spectral bands available. No knowledge of a priori and conditional probabilities is assumed if not the one found by observing the spectral measurements of samples of the two classes extracted from the two available bands. As shown in Fig. 1.15, the samples associated with the two classes are extracted from the 4 windows (i.e., the training set, 2 for each class) identified on the two components of the

1.7 Statistical Method

41 Features Domain

Space Domain

18

x

1 2

n

Band x

1

Band x

Multispectral image

50

0 Fig. 1.15 Example of the nonparametric Bayes classifier that classifies 2 types of territory (land and river) in the spectral domain starting from the training sets extracted from two bands (x1 , x2 ) of a multispectral image

multispectral image. In the 2D spectral domain, the training set samples of which we know the membership class are projected. A generic pixel pattern with spectral measurements x = (x1 , x2 ) (in the figure indicated with the circle) is projected in the features domain and associated with one of the classes using the nonparametric MAP classifier. From the training sets, we can have a very rough estimate of the a priori probabilities p(ωi ), of the likelihoods p(x|ωi ), and of the evidence p(x), as follows: 18 n ω1 = = 0.26 N 68 ni ω1 4 p(x|ω1 ) = = 0.22 = n ω1 18 p(ω1 ) =

p(x) =

2 

50 n ω2 = = 0.74 N 68 ni ω2 7 p(x|ω2 ) = = 0.14 = n ω2 50 p(ω2 ) =

p(x|ωi ) p(ωi ) = 0.22 × 0.26 + 0.14 × 0.74 = 0.1608

i=1

where n ω1 and n ω2 indicate the number of samples in the training sets belonging, respectively, to the earth and water class, ni ω1 and ni ω2 indicate the number of samples belonging, respectively, to the earth and water class found in the window centered in x the pattern to classify. Applying the Bayes rule (1.67), we obtain the posterior probabilities: p(ω1 |x) =

p(ω1 ) p(x|ω1 ) 0.26 × 0.22 = = 0.36 p(x) 0.1516

p(ω2 |x) =

p(ω2 ) p(x|ω2 ) 0.74 × 0.14 = = 0.64 p(x) 0.1608

For the MAP decision rule (1.70), the pattern x is assigned to the class ω2 (water zone).

42

1 Object Recognition

1.7.2.2 Calculation of the Bayes Error Probability To demonstrate that the Bayes rule (1.71) proves optimal, in the sense that it minimizes the classification error of a pattern x, it needs to evaluate p(err or ) in terms of posterior probability p(err or | x). In the classification context with K classes, the probability of Bayes error can be expressed by the following: p(err or ) =

K 

p(err or |ωk ) p(ωk )

(1.73)

k=1

where p(err or |ωk ) is the probability of incorrect classification of a pattern associated with the class ωk which is given by  p(err or |ωk ) = p(x|ωk )dx (1.74) C [k ]

With C[ k ] is indicated the set of regions complement of the region k , that is, C[k ] = Kj=1; j=k  j . That being said, we can rewrite the probability of incorrect classification of a pattern in the following form: p(err or ) = =

K   k=1 C [k ] K 

p(x|ωk ) p(ωk )dx

  p(ωk ) 1 −

k

k=1

=1−



K 

p(x|ωk )dx

(1.75)

 p(ωk )

k=1

k

p(x|ωk )dx

From which it is observed that the minimization of the error is equivalent to maximizing the probability of correct classification given by K  k=1

 p(ωk )

k

p(x|ωk )dx

(1.76)

This goal is achieved by maximizing the integral of the (1.76) which is equivalent to choosing the decision regions k for which p(ωk ) p(x|ωk ) is the value higher for all regions, exactly as imposed by the MAP rule (1.70). This ensures that the MAP rule minimizes the probability of error. It is observed (see Fig. 1.16) how the decision region translates with respect to the point of equal probability of likelihood p(x|ω1 ) = p(x|ω2 ) for different values of the a priori probability.

1.7 Statistical Method

43

p(x| 0.4 0.3 0.2 0.1

-2

0

2

4

x

Fig. 1.16 Elements that characterize the probability of error by considering the conditional density functions of the classes with normal distribution of equal variance and unequal a priori probability. The blue area corresponds to the probability of error in assigning a pattern of the class ω1 (standing in the region 1 ) to the class ω2 . The area in red represents the opposite situation

1.7.2.3 Calculation of the Minimum Risk for the Bayes Rule With the Bayes Rule, (1.70) has shown that the assignment of a pattern to a class, choosing the class with the highest a posteriori probability, the choice minimizes the classification error. With the calculation of the minimum error given by the (1.75), it is highlighted that a pattern is assigned to a class with the probability of error with the same unit cost. In real applications, a wrong assignment can have a very different intrinsic meaning due to a wrong classification. Incorrectly classifying a pixel pattern of an image (normally millions of pixels) is much less severe than a pattern context used to classify a type of disease. It is, therefore, useful to formulate a new rule that defines, through a cost function, how to differently weigh the probability of assigning a pattern to a class. In essence, the problem is formulated in terms of the minimum risk theory (also called utility theory in economy) which, in addition to considering the probability of an event occurrence, also takes into account a cost associated with the decision/action (in our case, the action is that of assigning a pattern to a class). Let  = {ω1 , . . . , ω K } the set of K classes and be A = {α1 , . . . , αa } the set of a possible actions. We now define a cost function C(αi |ω j ) that indicates the cost that would be done by performing the αi action that would assign a x pattern to the class ωi when instead the class (the state of nature) is ω j . With this setting, we can evaluate the conditional risk (or expected cost) R(αi |x) associated with the action αi to assign the observed pattern x to a class. Note the posterior probabilities p(ω j |x), but not knowing the true class ωi to associate to the pattern, the conditional risk associated with the action αi is given by R(αi |x) =

K  j=1

C(αi |ω j ) p(ω j |x)

i = 1, . . . , a

(1.77)

44

1 Object Recognition

The zero-order conditional risk R, considering the zero order cost function, is defined by  0 if i = j C(αi |ω j ) = i, j = 1, . . . , K (1.78) 1 if i = j according to the Bayes decision rule, it results  R(αi |x) = p(ω j |x) = 1 − p(ωi |x)

i = 1, . . . , a

(1.79)

j=i

from which it can be deduced that we can minimize conditional risk by selecting the action that minimizes R(αi |x) to classify the observed pattern x. It follows that we need to find a decision rule α(x) which relates the input space of the features with that of the actions, calculate the overall risk RT given by RT =

K  i=1

R(αi |x) =

K   i=1

K 

i j=1

C(αi |ω j ) p(ω j |x) p(x)

(1.80)

which will be minimal by selecting αi for which R(αi |x) is minimum for all x. The Bayes rule guarantees overall risk minimization by selecting the action α ∗ which minimizes the conditional risk (1.77): α ∗ = arg min R(αi |x) = arg min αi

αi

K 

C(αi |ω j ) p(ω j |x)

(1.81)

j=1

thus obtaining the Bayes Risk which is the best achievable result. Let us now calculate the minimum risk for an example of binary classification. Both α1 the action to decide that the correct class is ω1 , and similarly it is α2 for ω2 . We evaluate the conditional risks with the extended (1.77) rewritten: R(α1 |x) = C11 p(ω1 |x) + C12 p(ω2 |x) R(α2 |x) = C21 p(ω1 |x) + C22 p(ω2 |x) The fundamental decision rule is to decide for ω1 if R(α1 |x) < R(α2 |x) which in terms of posterior probability (remembering the 1.71) is equivalent to deciding for ω1 if (C21 − C11 ) p(ω1 |x) > (C12 − C22 ) p(ω2 |x) highlighting that the posterior probability is scaled by the cost differences (normally positive). Applying the Bayes rule to the latter (remembering the 1.71), we decide

1.7 Statistical Method

45

Fig. 1.17 Thresholds of the likelihood ratio (x) (related to the distributions of Fig. 1.16) related to the zero-order cost function

p(x|

x

for ω1 if

(C21 − C11 ) p(x|ω1 ) p(ω1 ) > (C12 − C22 ) p(x|ω2 ) p(ω2 )

(1.82)

Assuming that C21 > C11 and remembering the definition of likelihood ratio expressed by the (1.72), the previous Bayes rule can be rewritten as follows: (x) =

(C12 − C22 ) p(ω2 ) p(x|ω1 ) > p(x|ω2 ) (C21 − C11 ) p(ω1 )

(1.83)

The (1.83) states the optimal decision property: If the likelihood ratio exceeds a certain threshold, which is independent of the observed pattern x, decide for the class ω1 . An immediate application of the (1.83) is given the zero-order cost function (1.78). In this case, the decision rule is the MAP, that is, to say it is classified x to the class ωi if p(ωi |x) > p(ω j |x) for each j = i. Expressing in terms of likelihood with the (1.83), we will have (C12 − C22 ) p(ω2 ) = θC (C21 − C11 ) p(ω1 ) p(x|ω1 ) for which note the threshold θC it is decided that x belongs to the class ω1 if p(x|ω > 2) 0 1 θC . For the zero-order cost function with the cost matrix C = 1 0 , we have θC = 0 2 p(ω2 ) 2) 1 · p(ω p(ω1 ) = θ1 while for C = 1 0 , we have θC = 2 · p(ω1 ) = θ2 . Figure 1.17 shows the graph of the likelihood ratio (x) and the thresholds θ1 and θ2 . Considering the generic threshold θ = C12 /C21 , it is observed that with the increase of the cost on the class ω1 follows a reduction of the corresponding region 1 . This implies that for equiprobable classes (with θ1 threshold) p(ω1 ) = p(ω2 ), we have C12 = C21 = 1. With the threshold θ2 , we have p(ω1 ) > p(ω2 ) for which the region 1 is reduced. It is highlighted the advantage of verifying the problem of the decision, with the ratio of likelihood, through the scalar value of the threshold θ without the direct knowledge of the regions that normally describe the space of the N -dimensional features.

46

1 Object Recognition

1.7.2.4 Bayes Decision Rule with Rejection In the preceding paragraphs, we have seen that a Bayesian classifier, as good as possible, can happen that erroneously assigns a pattern to a class. When the error turns out to be very expensive, it may be rational not to risk the wrong decision and it is useful not to take a decision at the moment and, if necessary, delay it for a later decision-making phase. In this way, the patterns that would potentially be classified incorrectly are grouped in a class ω0 and classified as r ejected belonging to a region 0 in the feature domain. Subsequently, they can be analyzed with a manual or automatic ad hoc classification procedure. In classification applications with differentiated decision costs, the strategy is to define a compromise between acceptable error and rational rejection. This error-rejection compromise was initially formulated in [16] defining a general relationship between the probability of error and rejection. According to Chow’s rule, a x pattern is r ejected if the maximum posterior probability is lower than a threshold value t ∈ [0, 1]: arg max p(ωi |x) = p(ωk |x) < t

(1.84)

i=1,...,K

Alternatively, a x pattern is accepted and assigned according to the Bayes rule to the ωk class if it is (1.85) arg max p(ωi |x) = p(ωk |x) ≥ t i=1,...,K

It is shown that the threshold t to be chosen to carry out the rejection must be t < KK−1 where K is the number of classes. In fact, if the classes are equiprobable, the minimum value reachable by max p(ωi |x) is 1/K because the following relation i

must be satisfied: 1=

K 

p(ωi |x) ≤ K arg max p(ωi |x)

(1.86)

i=1,...,K

i=1

Figure 1.18 shows the rejection region 0 associated with a threshold t for two Gaussian classes of Fig. 1.16. The patterns that fall into the 1 and 2 regions are regularly classified with the Bayes rule. It is observed that the value of the threshold t strongly influences the dimensions of the region of rejection. For a given threshold t, the probability of correct classification c(t) is given by the (1.76) considering only the regions of acceptance (0 is excluded): c(t) =

K 

 p(ωk )

k=1

k

p(x|ωk )dx

(1.87)

The unconditional probability of rejection r (t), that is, the probability of a pattern falling into the region 0 is given by  r (t) = p(x)dx (1.88) 0

1.7 Statistical Method

47

Fig. 1.18 Chow’s rule applied for rejection classification for two classes with Gaussian distribution. The rejection threshold t defines the rejection region 0 , area where patterns with a high level of classification uncertainty fall

|x) 1

t 0.50

-4

-2

0

2

4

x

The value of the error e(t) associated with the probability of accepting to classify a pattern and classifying it incorrectly is given by e(t) =

K   k=1 k

[1 − max p(ωi |x)] p(x)dx = 1 − c(t) − r (t) i

(1.89)

From this relation, it is evident that a given value of correct classification c(t) = 1−r (t)−e(t) can be obtained by choosing to reduce the error e(t) and simultaneously increase the rejection error r (t), that is, to harmonize the compromise error-rejection being inversely related to each other. If a Ci j cost is considered even in the assignment of a pattern to the rejected class ω0 (normally lower than the wrong classification one), the cost function is modified as follows: ⎧ ⎪ ⎨0 if i = j Ci j = 1 if i = j i = 0, . . . , K ; j = 1, . . . , K (1.90) ⎪ ⎩ t if i = 0 (rejection class ω0 ) In [16], the following decision rule (see Fig. 1.18) with optimal rejection α(x) is demonstrated, which is also the minimum risk rule if the cost function is uniform within each decision class:  ωi if ( p(ωi |x) > p(ω j |x)) ∧ ( p(ωi |x) > t) ∀i = j α(x) = (1.91) ω0 otherwise reject where the rejection threshold t is expressed according to the cost of error e, the cost of rejection r , and the cost of correct classification c, as follows: t=

e−r e−c

(1.92)

where with c ≤ r it is guaranteed that t ∈ [0, 1], while if e = r we will back to the Bayes rule. In essence, Chow’s rejection rule attempts to reduce the error by rejecting border patterns between regions whose classification is uncertain.

48

1 Object Recognition

1.7.3 Other Decision Criteria The Bayes criteria described, based on the MAP decider or maximum likelihood ML, need to define the cost values Ci j and know the probabilities a priori p(ωi ). In applications where this information is not known, in the literature [17,18], the decision criterion Minimax and that of Neyman–Pearson are proposed. The Minimax criterion is used in applications where the recognition system must guarantee good behavior over a range of possible values rather than for a given priori probability value. In these cases, although the a priori probability is not known, its variability can be known for a given interval. The strategy used in these cases is to minimize the maximum value of the risk by varying the prior probability. The Neyman–Pearson criterion is used in applications where there is a need to limit the probability of error within a class instead of optimizing the overall conditional risk as with the Bayes criterion. For example, we want to fix a certain attention on the probability of error associated with a false alarm and minimize the probability of failure to alarm as required in radar applications. This criterion evaluates the probability of error 1 in classifying patterns of ω1 in the class ω2 and vice versa the probability of error 2 for patterns of the class ω2 attributed to the class ω1 . The strategy of this criterion is to minimize the error on the class ω1 by imposing to find the minimum of 1 = 2 p(x|ω1 )dx correlating it to the error  limiting it  below a value α, that is, 2 = 1 p(x|ω2 )dx < α. The criterion is set as a constrained optimization problem, whose solution is due to the Lagrange multipliers approach which minimizes the objective function: F = 1 + λ(2 − α) It is highlighted the absence of p(ωi ) and costs Ci j while the decision regions i are to be defined with the minimization procedure.

1.7.4 Parametric Bayes Classifier The Bayes decision rule (1.67) requires knowledge of all the conditional probabilities of the classes and a priori probabilities. Its functional and exact parameters of these density functions are rarely available. Once the nature of the observed patterns is known, it is possible to hypothesize a parametric model for the probability density functions and estimate the parameters of this model through sample patterns. Therefore, an approach used to estimate conditional probabilities p(x|ωi ) is based on a training set of patterns Pi = {x1 , . . . , xn i } xi j ∈ Rd associated with the class ωi . In the parametric context, we assume the form (for example, Gaussian) of the probability distribution of the classes and the unknown parameters θi that describe it. The estimation of the parameters θk , k = 1, . . . , n p (for example in the Gaussian form are θ1 = μ; θ2 = σ and p(x) = N (μ, σ )) can be done with known approaches of maximum likelihood or Bayesian estimation.

1.7 Statistical Method

49

1.7.5 Maximum Likelihood Estimation—MLE The parameters that characterize the hypothesized model (e.g., Gaussian) are assumed known (for example, mean and variance) but are to be determined (they represent the unknowns). The estimation of the parameters can be influenced by the choice of the training sets and an optimum result is obtained using a significant number of samples. With the M L E method, the goal is to estimate the parameters θˆi which maximizes the likelihood function p(x|ωi ) = p(x|θi ) defined using the training set Pi : θˆi = arg max [ p(P j |θ j )] = arg max [ p(x j1 , . . . , x jn i |θ j )] (1.93) θj

θj

If we assume that the patterns of the training set Pi = {x1 , . . . , xn i } form a sequence of variables random independent and identically distributed (iid),13 the likelihood function p(Pi |θi ) associated with class ωi can be expressed as follows: p(Pi |θi ) =

ni 

p(xki |θi )

(1.94)

k=1

This probability density is considered as an ordinary function of the variable θi and dependent on the n i pattern of the training set. The assumption of i.i.d. in real applications: it is not maintained and needs to choose training sets as much as possible in the conditions of independence of the observed patterns. In common practice, assuming this independence, the MLE method (1.93) is the best solution to estimate the parameters that describe the known model of the probability density function p(Pi |θi ). A mathematical simplification is adopted expressing in logarithmic terms the (1.93) and replacing the (1.94), we get      ni ni ˆθi = arg max log p(xki |θi ) = arg max log p(xki |θi ) θk

k=1

θk

(1.95)

k=1

The logarithmic function has the property of being monotonically increasing besides the evidence of expressing the (1.93) in terms of sums instead of products, thus simplifying the procedure of finding the maximum especially when the probability function model has exponential terms, as happens with the assumption of Gaussian distribution. Given the independence of training sets Pi = x1 , . . . , xn i of patterns associated with K classes ωi , i = 1, . . . , K , we will omit the index i-th which indicates the class in estimating the related parameters θi . In essence, the parameter estimation procedure is repeated independently for each class.

13 Implies that the patterns all have the same probability distribution and are all statistically independent.

50

1 Object Recognition

1.7.5.1 MLE Estimate for Gaussian Distribution with Mean Unknown According to the classical definition of the central limit theorem, the probability distribution of the sum (or mean) of the i.i.d variables with finite mean and variance approaches the Gaussian distribution. We denote by P = {x1 , . . . , xn } the training set of pattern d-dimensional of a generic class ω whose distribution is assumed to be Gaussian p(x) = N (μ, ) where μ is the vector unknown mean and  is the covariance matrix. Considering the generic pattern xk the MLE estimate of the mean vector μ for the (1.95), we have ˆ = arg max θˆ = μ θ

n  k=1 n 

log( p(xk |θ))

  1 T −1 exp − − μ)  (x − μ) (x k k (2π )d/2 ||1/2 2 θ k=1     n  1 1 T −1 log − − μ)  (x − μ) (x = arg max k k (2π )d/2 ||1/2 2 θ

= arg max

 log

1

(1.96)

k=1

The maximum likelihood value of the function for the sample patterns P is obtained differentiating with respect the parameter θ and setting to zero: ∂

n

k=1 log( p(xk |θ ))

∂θ

=

n 

 −1 (xk − μ) = 0

(1.97)

k=1

from which, by eliminating the factor , we get ˆ = μ

n 1 xk n

(1.98)

k=1

It is observed that the estimate of the mean (1.98) obtained with the MLE approach leads to the same result of the mean calculated in the traditional way with the average of the training set patterns.

1.7.5.2 MLE Estimate for Gaussian Distribution with μ and  Unknown The problem resolves as before, the only difference consists in the calculation of the gradient ∇θ instead of the derivative since we have two variables to estimate θˆ = ˆ T . For simplicity, let’s consider the one-dimensional case with ˆ ) (θˆ 1 , θˆ 2 )T = (μ, the training set of patterns P assumed with the normal distribution p(x) = N (μ, σ 2 ), with μ and σ to estimate. For the generic pattern xk calculating the gradient, we have ⎡ ⎤ n ∂  log( p(x |θ )) k ⎢ ∂θ1 k=1 ⎥  1 n  ⎢ ⎥  θ2 (x k − θ1 ) ⎢ ⎥ (1.99) ∇θ = ⎢ (xk −θ1 )2 1 ⎥= n ⎣ ∂  ⎦ k=1 − 2θ2 + 2θ22 log( p(xk |θ )) ∂θ2 k=1

1.7 Statistical Method

51

The maximum likelihood condition is obtained by setting the gradient function to zero (∇θ = 0) (1.99) obtaining n  1 (xk − θˆ1 ) = 0 θˆ2



k=1

n n   1 (xk − θˆ1 )2 + =0 2θˆ2 2θˆ 2 k=1

k=1

2

from which, with respect to θˆ1 and θˆ2 , we get the estimates of μ and σ 2 , respectively, as follows: θˆ1 = μˆ =

n 1 xk n

θˆ2 = σˆ 2 =

k=1

n 1 (xk − μ) ˆ 2 n

(1.100)

k=1

The expressions (1.100) MLE estimate of variance and mean correspond to the traditional variance and mean calculated on training set patterns. Similarly, it can be shown [17] that the MLE estimates, for a multivariate Gaussian distribution in d-dimensional, are the traditional mean vector μ and the covariance matrix , given by n n 1 1 ˆ = ˆ k − μ) ˆ T ˆ =  xk (xk − μ)(x (1.101) μ n n k=1

k=1

Although the maximum likelihood estimates correspond to the traditional calculation methods of the mean and covariance, the degree of reliability of these values with respect to the real ones remains to be verified. In other words, how much the hypothesized Gaussian distribution adapts to the training set of the selected pattern. In statistics, this is verified by evaluating whether the estimated parameter has a bias (distortion or deviation), that is, there is a difference between the expected value of an estimator and the real value of the parameter to be estimated. The distortion of the MLE estimator is verified if the expected value is different from the quantity it estimates. In this case, for the mean parameter, we have    n n 1 1 xk = E[xk ] = μ E[μ] ˆ =E n n k=1

(1.102)

k=1

from which it results that the estimated mean is not distorted (unbiased), while for the estimated variance with MLE, we have    n n−1 2 1 2 σ = σ 2 = (xk − μ) ˆ E[σˆ ] = E n n 2

(1.103)

k=1

from which it emerges that the variance is distorted (biased). It is shown that the magnitude of a distorted estimate is related to the number of samples considered,

52

1 Object Recognition

for n → ∞ asymptotically the bias is zero. A simple estimate unbiased for the covariance matrix is given by ˆU = 

1  ˆ k − μ) ˆ T (xk − μ)(x n−1 n

(1.104)

k=1

1.7.6 Estimation of the Distribution Parameters with the Bayes Theorem The starting conditions with the Bayesian approach are identical to those of the maximum likelihood, i.e., from the training set of pattern P = {x1 , . . . , xn } x j ∈ Rd associated with the generic class ω, we assume the form (for example, Gaussian) of the probability distribution and the unknown parameter vector θ describing it. With the Bayesian estimate (also known as Bayesian learning), θ is assumed as a random variable whose a priori probability distribution p(θ ) is known and intrinsically contained in the training set P. The goal is to derive the a posteriori probability distribution p(θ |x, P) from the training set of patterns of the class ω. Having said this, the formula of Bayes theorem (1.67) is rewritten as follows: p(ω|x, P) = p(θˆ |x, P) =

p(x|θˆ , P) p(ω) p(x)

(1.105)

where p(x|θˆ , P) is the parametric conditional probability density (likelihood) to estimate, derivable from training set P associated with the generic class ω; p(ω) is the a priori probability of the class of the known form, while p(x) can be considered as a normalization constant. In reality, the explicit probability density p(x) is not known, what is assumed to be known is the parametric form of this probability density of which we want to estimate the vector θ . In other words, the relationship between the density of p(x) on the training set P results through the vector θ which models the assumed form of the probability density to then affirm that the conditional density function p(x|θˆ ) is known. From the analysis of the training set of the patterns ˆ observed with the (1.105), we arrive at the posterior probability p(θ|P) with the hope of getting an estimate of the value of θ with the least uncertainty. By the definition of conditional probability, we have p(x, θ |P) = p(x|θ , P) p(θ |P) where it is evident that the probability p(x|θ , P) is independent of P since with the knowledge of θ , this probability is completely parametrically determined by the mathematical form of the probability distribution of x. It follows that we can write p(x, θ |P) = p(x|θ ) p(θ |P)

1.7 Statistical Method

53

and by the total probability theorem, we can calculate p(x|P) (very close to p(x) as much as possible) the conditional density function by integrating the joint probability density p(x|θ , P) on the variable θ :  p(x|P) = p(x|θ ) p(θ |P)dθ (1.106) where integration is extended over the entire parametric domain. With the (1.106), we have a relationship between the conditional probability of the class with the parametric conditional probability of the class (whose form is known) and the posterior probability p(θ|P) for the variable θ to estimate. With the Bayes theorem it is possible to express the posterior probability p(θ|P) as follows: p(θ |P) =

p(P|θ ) p(θ) p(P|θ) p(θ ) = p(P) p(P|θ) p(θ )dθ

(1.107)

Assuming that the patterns of the training set P form a sequence of independent and identically distributed (iid) random variables, the likelihood probability function p(P|θ) of the (1.107) can be calculated with the product of the conditional probability densities of the class ω: n  p(P|θ) = p(xk |θ) (1.108) k=1

1.7.6.1 Bayesian Estimation for Gaussian Distribution with Unknown Mean Let us now consider an example of a Bayesian learning application to estimate the mean of a one-dimensional normal distribution with known variance. Let us indicate with P = {x1 , . . . , xn } the training set of one-dimensional iid pattern of a generic class ω whose distribution is assumed to be Gaussian N (μ, σ 2 ) where μ is the unknown mean and σ 2 is the known variance. We, therefore, indicate with p(x|θ = μ) = N (x; μ, σ 2 ) the probability of resultant likelihood. We also assume that the a priori probability for the mean θ = μ has the normal distribution N (μ0 , σ02 ): − 1 2 (μ−μ0 )2 1 p(μ) = √ e 2σ0 2π σ0

(1.109)

Applicando la regola di Bayes (1.107) possiamo calcolare la probabilità a posteriori p(μ|P): p(μ|P ) =

n p(P |μ) p(μ) p0 (μ)  p(xk |μ) = p(P ) p(P ) k=1

n   − 12 (μ−μ0 )2 1 1 1 − 1 (x −μ)2 = √ e 2σ 2 k e 2σ0 √ p(P ) 2πσ0 2πσ k=1

(1.110)

54

1 Object Recognition

We observe that the posterior probability p(μ|P) depends on the a priori probability p(μ) and therefore from the training set of the selected patterns P. This dependence influences the Bayesian estimate, that is, the value of p(μ|P), observable with the increment n of the training set samples. The maximum of p(μ|P) is obtained by computing the partial derivative of the logarithm of the (1.110) with respect to μ that ∂ log p(μ|P) and equaling to zero, we have is, ∂μ   n  1 1 ∂ 2 2 − 2 (μ − μ0 ) + − 2 (xk − μ) = 0 ∂μ 2σ 2σ0 k=1

(1.111)

from which, after some algebraic considerations, we obtain μn =

n nσ02 1  σ2 μ + xk 0 σ 2 + nσ02 σ 2 + nσ02 n k=1 ' () * ' () * μ I nitial

(1.112)

Estimate M L E

It is highlighted  that for n → ∞, the estimate of μn tends to the estimate MLE (i.e., μ = k xk ) starting from the initial value μ0 . The standard deviation σn is calculated in the same way, given by 1 1 n = + 2 σn σ σ0

=⇒

σn2 =

σ02 σ 2 nσ02 + σ 2

(1.113)

from which it emerges that the posterior variance of μ, σn2 , tends to zero as well as 1/n for n → ∞. In other words, with the posterior probability p(μ|P) calculated with the (1.110), we get the best estimate μn of μ starting from the training set of n observed patterns, while σn2 represents the uncertainty of μ, i.e., its posterior variance. Figure 1.19 shows how Bayesian learning works, that is, as the number of samples in the training set increases, the p(μ|P) becomes more and more with the peak accentuated and narrowed toward the true value of the mean μ. The extension to the multivariate case [18] of the Bayesian estimate for Gaussian distribution with mean unknown μ and covariance matrix  known, is more complex, as is the calculation of the estimate of the mean and of the covariance matrix both not known for a normal distribution [17].

1.7.6.2 Bayesian Estimate of Conditional Density for Normal Distribution With the (1.110), we got the posterior density of the mean p(μ|P). Let us now propose to calculate the conditional density of the class p(x|P) = p(x|ω, P) remembering that for simplicity we had omitted the indication of the generic class but remained understood that the training set considered P was associated with a generic class ω. The density of the class p(x|P) is obtained by replacing in the (1.106) (considering

1.7 Statistical Method

55

p(μ|Ρ) 50

n=20 n=10

25 n=5 n=1

Iniziale 0

0.2

0.4

0.6

0.8

μ

Fig. 1.19 Bayesian learning of the mean of a Gaussian distribution with known variance starting from a training set of patterns

θ = μ) the posterior density p(μ|P) given by the (1.110) and assumed with normal distribution N (μn , σn2 ):  p(x|P ) =

p(x|μ) p(μ|P )dμ     1 1 1  μ − μn 2 1  x − μ 2 = exp − exp − dμ √ √ 2 σ 2 σn 2π σ 2π σn   1 (x − μn )2 1 f (σ, σn ) exp − = 2 2 2π σ σn 2 σ + σn 

where

 f (σ, σn ) =

(1.114)

 2   σn2 x + σ 2 μn 1 σ 2 + σn2 μ − dμ exp − 2 σ 2 σn2 σ 2 + σn2

We highlight that the density p(x|P), as a function of x, results with normal distribution: (1.115) p(x|P) ∼ N (μn , σ 2 + σn2 ) being proportional to the expression exp [−(1/2)(x − μn )2 /(σ 2 + σn2 )]. In conclusion, to get the conditional density of the class p(x|P) = p(x|ω, P), with a known parametric form described by the normal distribution, p(x|μ) ∼ N (μ, σ ), the parameters of the normal are replaced μ = μn and σ 2 = σn2 . In other words, the value of μn is considered as the mean true while the initial variance known σ , once the posterior density of the mean p(μ|P), is calculated, is increased by σn2 to account for the uncertainty on the significance of the training set due to the poor knowledge of the mean μ. This contrasts with the MLE approach which gets a point estimate of the parameters μˆ and σˆ 2 instead of directly estimating the class distribution p(x|ω, P).

56

1 Object Recognition

1.7.7 Comparison Between Bayesian Learning and Maximum Likelihood Estimation With the MLE approach, a point value of the parameter θ is estimated which maximizes the likelihood density p(P|θ ). Therefore with MLE, we get an estimated value of θˆ not considering the parameter a random variable. In other words, with reference to the Bayes equation, (1.107), MLE treats the ratio p(θ )/ p(P) = pr ob. priori/evidence as a constant and does not take into account the a priori probability in the calculation procedure of the θ estimation. In contrast, Bayesian learning instead considers the parameter to be estimated θ as a random variable. Known the conditional density and a priori probability, the Bayesian estimator obtains a probability distribution p(θ |P) associated with θ instead of a point value as it happens for MLE. The goal is to select an expected value of θ assuming a small possible variance of the posterior density p(θ |P). If the variance is very large, a non-good estimate is assumed of θ . The Bayesian estimator incorporates the information a priori and if this is not significant, the posterior density is determined by the training set (data-driven estimator). If it is significant, the posterior density is determined by the combination of the priori density and by the training set of patterns. If even the training set has a significant cardinality of patterns of fact, these dominate on the a priori information making it less important. From this, it follows that between the two estimators, there is a relation when the number of patterns n of the training set is very high. Considering the Bayes equation (1.107), we observe that the denominator can be neglected as independent of θ and we have that p(θ |P) ∝ p(P|θ) p(θ )

(1.116)

where the likelihood density has a peak at the maximum θ = θˆ . With n very large, the likelihood density shrinks around its maximum value while the integral that estimates the conditional density of the class with the Bayesian method can be approximated (see Eq. 1.106) as follows:   p(x|P) = p(x|θ ) p(θ|P)dθ ∼ (1.117) = p(x|θˆ ) p(θ |P)dθ = p(x|θˆ )  remembering that p(θ |P)dθ = 1. In essence, Bayesian learning instead of finding a precise value of θ calculates a mean over all values θ of the density p(x, θ ), weighted with the posterior density of the parameters p(θ |P). In conclusion, the two estimators tend, approximately, to similar results, when n is very large, while for small values, the results are very different.

1.8 Bayesian Discriminant Functions

57

1.8 Bayesian Discriminant Functions An effective way to represent a pattern classifier is based on the discriminant functions. If we denote by gi (x) a discriminant function (gi : Rd → R) associated with the class ωi , a classifier will assign a pattern x ∈ Rd to class ωi ∈ R if gi (x) > g j (x)

∀ j = i

(1.118)

A classifier based on K discriminating functions, as many as there are classes, constitutes a computational model capable of discriminating the function with the highest value to select the class to a generic input pattern (see Fig. 1.20). Considering the discriminating functions based on the Bayes theory, a general form based on the minimum risk (see Eq. 1.77) is the following: gi (x) = −R(αi |x) =−

K 

C(αi |ω j ) p(ω j |x)

(1.119)

j=1

where the sign is motivated by the fact that the minimum conditional risk corresponds to the maximum discriminating function. In the case of a minimum zero-order error function, the further simplified Bayesian discriminant function is given by gi (x) = p(ωi |x). The choice of discriminating functions is not unique since a generic function gi (x) can be replaced with f (gi (x)) where f (•) is a growing monotonic function that does not affect the accuracy of the classification. We will see that these transformations are useful for simplifying expressions and calculation.

Class assignment Select max Costs Discriminant Functions

Features

(x)

K

xd

Fig. 1.20 Functional scheme of a statistical classifier. The computational model is of the type bottom-up as shown by the arrows. In the first level are the features of the patterns processed in the second level with the discriminating functions to choose the one with the highest value that assigns the pattern to the class to which it belongs

58

1 Object Recognition

The discriminating functions for the classification with minimum error are p(x|ωi ) p(ωi ) gi (x) = p(ωi |x) =  K k=1 p(x|ωk ) p(ωk ) gi (x) = p(x|ωi ) p(ωi ) gi (x) = log p(x|ωi ) + log p(ωi )

(1.120) (1.121) (1.122)

which produce the same classification results. As already described in Sect. 1.6, the discriminant functions partition the feature space into the K decision regions i corresponding to the ωi classes according to the following: i (x) = {x|gi (x) > g j (x)}

∀ j = i

(1.123)

The decision boundaries that separate the regions correspond to the valleys between the discriminant functions described by the equation gi (x) = g j (x). If we consider a two-class classification ω1 and ω2 , we have a single discriminant function g(x) which can be expressed as follows: g(x) = g1 (x) − g2 (x)

(1.124)

In this case, the decision rule results x ∈ ω1

if

g(x) > 0;

otherwise

x ∈ ω2

(1.125)

with the decision made in relation to the sign of g(x).

1.8.1 Classifier Based on Gaussian Probability Density The Gaussian probability density functions, also called normal distribution, have already been described in different contexts in this volume considering their particularity in modeling well the observations of various physical phenomena and for its treatability in analytical terms. In the context of classification, it is widely used to model the observed measurements of the various classes often subject to random noise. We also know from the central limit theorem that the distribution of the sum of a high number n of independent and identically distributed random variables tends to distribute nor mal, independently of the distribution of the single random variables. A Bayesian classifier is based on the conditional probability density p(x|ωi ) and the priori probability density p(ωi ) of the classes. Now let’s see how to get the discriminant functions of a Bayesian classifier by assuming the classes with the multivariate normal distribution (MND). The objective is to derive simple forms of the discriminating functions by exploiting some properties of the covariance matrix of the MNDs. A univariate normal density is completely described by the mean μ and the variance σ 2 and abbreviated as p(x) ∼ N (μ, σ 2 ). A multivariate normal density is

1.8 Bayesian Discriminant Functions

59

described by the mean vector μ and by the covariance matrix , and in short form is indicated with p(x) ∼ N (μ, ) (see Eq. 1.101). For an arbitrary class ωi with patterns described by the vectors x = (x1 , . . . , xd ) to d-dimensions with normal density, the mean vector is given by μ = (μ1 , . . . , μd ) with μi = E[xi ] while the covariance matrix is  = E[(x − μ)(x − μ)T ] with dimensions d × d. We will see that the covariance matrix is fundamental to characterize the discriminant functions of a classifier based on the Gaussian model. We, therefore, recall some properties of . Contains the covariance between each pair of features of a pattern x represented by elements outside the principal diagonal given by i = j (1.126) i j = E[(xi − μi )(x j − μ j )] while the diagonal components represent the variance of the f eatur es: ii = σi2 = E[(xi − μi )2 ]

i = 1, . . . , d

(1.127)

By definition,  is symmetric, that is, i j =  ji with d(d + 1)/2 parameters free that with d averages μi become d(d + 3)/2. If the covariance elements are null (i j = 0), it implies that the features xi and x j are statistically independent. Let us now look at the distribution of the patterns from the geometric point of view x ∈ Rd in the space of the features. In the hypothesis of normal probability density, the patterns of a generic class ωi tend to be grouped together, whose form is described by the covariance matrix  and the center of mass of this grouping is defined from the vector of the averages μi . From the analysis of the eigenvector s and eigenvalues (see Sect. 2.10 Vol. II) of the covariance matrix, we know that the eigenvectors φ i of  correspond to the main axes of the hyperellipsoid while the eigenvalues λi determine the length of the axes (see Fig. 1.21). An important characteristic of the multivariate normal density is represented by the quadratic form that appears in the exponential of the normal function (1.96) that we rewrite here in the form: D 2M = (x − μ)T  −1 (x − μ)

(1.128)

where D M is known as the Mahalonobis distance between the average vector μ and the vector pattern x. The locus of points with constant value of density actually describe with the (1.128) the hyperellisoid with constant distance of Mahalonobis.14 If  = I, where I is the identity matrix, the (1.128) becomes the Euclidean distance (norm 2). If  is diagonal, the resulting measure becomes the normalized Euclidean distance given by + d (xi −μi )2 . It should also be pointed out that the Mahalanobis distance can also D(x, μ) = i=1 σ2

14

i

be defined as a dissimilarity measure between two vector patterns x and,y with the same probability density function and with covariance matrix , defined as D(x, μ) = (x − y)T  −1 (x − y).

60

1 Object Recognition

λ

{

White transformation

{

p(x|ωi

0

Fig. 1.21 2D geometric representation in the feature domain of a Gaussian pattern distribution. We observe their grouping centered on the average vector μ and the contour lines, which in the 2D domain are elli pses, which represent the set of points with equal probability density of the Gaussian distribution. The orientation of the grouping is determined by the eigenvectors of the covariance matrix, while the eigenvalues determine the extension of the grouping

The dispersion of the class patterns centered on the average vector is measurable by the volume of the hyperellipsoid in relation to the values of D M and . In this context, it may be useful to proceed with a linear transformation (see Chap. 2 Vol. II) of the x patterns to analyze the correlation level of the features and reduce their dimensionality, or normalize the vectors x to have unrelated components with variance equal to unity. The normalization of the features is obtained through the so-called whitening of the observations, that is, by means of a linear transformation (known as whitening transform [19]) such as to have unrelated features with unitary variance.15 With this transformation, the ellipsoidal distribution in the feature space becomes (see Fig. 1.21) spherical (covariance matrix is equal to the identity matrix after the transformation  y = I) and the Euclidean metric is used instead of the Mahalanobis distance (Eq. 1.128).

15 The whitening transform is always possible and the method used is still based on the eigen decomposition of the covariance matrix  = T calculated on the input patterns x. It is shown that the whitening transformation is given by y = −1/2 T x, which in fact is equivalent to first executing the orthogonal transform y = AT x = T x and then normalizing the result with −1/2 . In other words, with the first transformation, we have the principal components and with normalization, the distribution of the data is made symmetrical. The direct transformation (whitening) is y = Aw x = −1/2 T x.

1.8 Bayesian Discriminant Functions

61

1.8.2 Discriminant Functions for the Gaussian Density Among the Bayesian discriminant functions gi (x), described above for the classification with minimal error, we consider the (1.122) that in the hypothesis of multivariate conditional normal density p(x|ωi ) ∼ N (μi , i ) is rewritten in the form: 1 1 d gi (x) = − (x − μi )T i−1 (x − μi ) − log |i | − log 2π + log p(ωi ) (1.129) 2 2 '2 () * constant

having replaced in the discriminant function (1.122) for the class ωi its conditional density of multivariate normal distribution p(x|ωi ) given by 1 e p(x|ωi ) = d/2 (2π ) |i |1/2

-

− 21 (x−μi )T i−1 (x−μi )

.

The (1.129) is strongly characterized by the covariance matrix i for which different assumptions can be made.

1.8.2.1 Assumption: i = σ 2 I With this hypothesis, the features (x1 , x2 , . . . , xd ) are statistically independent and have the same variance σ 2 with different means μ. The patterns are distributed in the space of the features forming hyperspherical groupings with equal dimensions and centered in μi . In this case, the calculation of the determinant and of the inverse matrix of i are, respectively, |i | = σ 2d and i−1 = (1/σ 2 )I (I stands for the identity matrix). Also, considering that the constant term in the (1.129) and |i | they are both terms independent of i, they can be ignored as irrelevant. It follows that a simplification of the discriminating functions is obtained: gi (x) = −

x − μi 2 (x − μi )T (x − μi ) + log p(ωi ) = − + log p(ωi ) 2 2σ 2σ 2 . 1 - T = − 2 x x − 2μiT x + μiT μi ) + log p(ωi ) 2σ

(1.130)

From (1.130), it is noted that the discriminating functions are characterized by the Euclidean distance between the patterns and averages of each class ( x − μi 2 ) and by the normalization terms given by the variance (2σ 2 ) and the prior density (offset log p(ωi )). It doesn’t really need to calculate distances. In fact, with the expansion of the quadratic form (x − μi )T (x − μi ), from the (1.130), it is evident that the quadratic term x T x is identical for all i and can be eliminated. This allows to obtain the equivalent of the linear discriminant functions as follows: gi (x) = wiT x + wi0

(1.131)

where wi (x) =

1 μ σ2 i

wi0 = −

1 T μ μ + log p(ωi ) 2σ 2 i i

(1.132)

62

1 Object Recognition

p(x|ωi) 0.4

p(x

0.3 0.2 0.1

-2

0

2

4

x

Fig.1.22 1D geometric representation for two classes in the feature space. If the covariance matrices for the two distributions are equal and with identical a priori density p(ωi ) = p(ω j ), the Bayesian surface of separation in the 1D representation is the line passing through the intersection point of the two Gaussians p(x|ωi ). For d > 1, the separation surface is instead the hyperplane to (d − 1)-dimensions with the groupings of the spherical patterns in d-dimensions

The term wi0 is called the threshold (or bias) of the i-th class. The Bayesian decision surfaces are hyperplanes defined by the equations: gi (x) = g j (x)

⇐⇒

gi (x) − g j (x) = (wi − w j )T x + (wi0 − w j0 ) = 0

(1.133)

The hyperplane equations considering the (1.131) and (1.132), we can rewrite it in the form: (1.134) w T (x − x0 ) = 0 where

x0 =

w = μi − μ j

(1.135)

1 σ2 p(ωi ) (μi + μ j ) − (μ − μ j ) log 2

μi − μ j 2 p(ω j ) i

(1.136)

Equation (1.134) describes the decision hyperplane separating the class ωi from ω j and is perpendicular to the line joining the centroids μi and μ j . The x0 point is determined by the values of p(ωi ) and p(ω j ), the point from which the hyperplane that is normal to the vector w passes. A special case occurs when p(ωi ) = p(ω j ) for each class. Figures 1.22 and 1.16 show the distributions in feature space for two classes, respectively, for density a priori equal p(ωi ) = p(ω j ) and in the case of sensible difference with p(ωi ) > p(ω j ). In the first case, from the (1.136), we observe that the second addend becomes zero and the separation point of the classes x0 is at the midpoint between the vectors μi and μ j , and the hyperplane bisects perpendicularly the line joining the two averages.

1.8 Bayesian Discriminant Functions

63

The discriminating function becomes gi (x) = − x − μi 2

(1.137)

thus obtaining a classifier named at minimum distance. Considering that the (1.137) calculates the Euclidean distance between the x and the averages μi which in this context represent the prototypes of each class, the discriminant function is the typical of a template matching classifier. In the second case, with p(ωi ) = p(ω j ), the point x0 moves away from the most probable class. As you can guess, if the variance has low values (patterns more grouped) than the value of the distance between the averages μi − μ j , the a priori densities will have a lower influence.

1.8.2.2 Assumption:  i =  (Diagonal) This case includes identical but ar bitrar y covariance matrices for each class and the features (x1 , x2 , . . . , xd ) are not necessarily independent. The geometric configuration of the grouping of the patterns forms, for each class, a hyperellipsoid of the same size centered in μi . The general discriminant function expressed by the (1.129) in this case is reduced to the following form: gi (x) = −

1 (x − μi )T i−1 (x − μi ) + log p(ωi ) 2 ' () *

(1.138)

Mahalanobis distance Eq. 1.128

having eliminated the constant term and the term with |i | being both independent of i. If the a priori densities p(ωi ) were found to be identical for all classes, their contribution is ignored, and the (1.138) would be reduced only to the term of the Mahalanobis distance. Basically, we have a classifier based on the following decision rule: a pattern x is classified by assigning it to the class whose centroid μi is at the minimum distance of Mahalanobis. If p(ωi ) = p(ω j ), the separation boundary moves in the direction of less priori probability. It is observed that the Mahalanobis distance becomes the Euclidean distance x − μ 2 = (x − μ)T (x − μ) if  = I. Expanding in the Eq. (1.138) only the expression of the Mahalanobis distance and eliminating the terms independent of i (i.e., the quadratic term xT  −1 x), the linear discriminant functions are still obtained: gi (x) = wiT x + wi0

(1.139)

where wi (x) =  −1 μi

1 wi0 = − μiT  −1 μi + log p(ωi ) 2

(1.140)

The term wi0 is called the threshold (or bias) of the i-th class. The linear discriminant functions, also in this case, represent geometrically the hypersurface surfaces of

64

1 Object Recognition

separation of the adjacent classes defined by the equation:

where

x0 =

w T (x − x0 ) = 0

(1.141)

w =  −1 (μi − μ j )

(1.142)

log[ p(ωi )/ p(ω j )] 1 (μ + μ j ) − (μ − μ j ) 2 i (μi − μ j )T  −1 (μi − μ j ) i

(1.143)

It is observed that the vector w, given by the (1.142), does not result, in general, in the direction of the vector (μi − μ j ). It follows that the separation hyperplane (of the regions i and  j ) is not perpendicular to the line joining the two averages μi and μ j . As in the previous case, the hyperplane always intersects the junction line of the averages in x0 with the position that depends on the values of the a priori probabilities. If the latter are different, the hyperplane moves toward the class with less prior probability.

1.8.2.3 Assumption: i Arbitrary (Non-diagonal) In the general case of multivariate normal distribution, the covariance matrices are different for each class and the features (x1 , x2 , . . . , xd ) are not necessarily independent. Therefore, the general discriminant function expressed by the (1.129) in this case has the following form: 1 1 gi (x) = − (x − μi )T i−1 (x − μi ) − log |i | + log p(ωi ) 2 2

(1.144)

having only been able to eliminate the constant term d2 log 2π , while the others are all dependent on i. The discriminant functions (1.144) are quadratic functions and can be rewritten as follows: (1.145) gi (x) = x T Wi x + wiT x + wi0 ' () * '()* '()* x-Quadratic

where

1 Wi = −  i−1 2

x-Linear

Costant

wi =  i−1 μi

1 1 wi0 = − μiT  i−1 μi + − log |i | + log p(ωi ) 2 2

(1.146)

(1.147)

1.8 Bayesian Discriminant Functions

65

Fig. 1.23 One-dimensional representation of decision regions for two classes with normal distribution with different variance. It is observed that the relative decision regions are not connected in this case with equal a priori probabilities

p(x|ωi) 0.4

ω 0.3

0.2

0.1

-5

-2.5

0

2.5

5

7.5

x

Ω

The term wi0 is called the threshold (or bias) of the i-th class. The decision surface between two classes is quadric hypersurface.16 These decision surfaces may not be connected even in the one-dimensional case (see Fig. 1.23). Any hypersurface can be generated from two Gaussian distributions. The surfaces of separation become more complex when the number of classes is greater than 2 even with Gaussian distributions. In these cases, it is necessary to identify the pair of classes involved in that particular area of the feature space.

1.8.2.4 Conclusions Let us briefly summarize the specificities of the Bayesian classifiers described. In the hypothesis of Gaussian distribution of classes, in the most general case, the Bayesian classifier is quadratic. If the hypothesized Gaussian classes all have equal variance, the Bayesian classifier is linear . It is also highlighted that a classifier based on the Mahalanobis distance is optimal in the Bayesian sense if the classes have a normal distribution, equal covariance matrix, and equal a priori probability. Finally, it is pointed out that a classifier based on the Euclidean distance is optimal in the Bayesian sense if the classes have a normal distribution, equal covariance matrix proportional to the identical matrix, and equal a priori probability. Both the Euclidean and Mahalanobis distance classifiers are linear. In various applications, different distance-based classifiers (Euclidean or Mahalanobis) are used making implicit statistical assumptions. Often such assumptions, for example, those on the normality of class distribution are rarely true, obtaining bad results. The strategy is to verify pragmatically if these classifiers solve the problem. 16 In the geometric and mathematical discipline, we define quadric surface a hypersurface of a space

d-dimensional on real (or complex) numbers represented by a second-order polynomial equation. The hypersurface can take various forms: hyperplane, hyperellipsoid, hyperspheroid, hypersphere (special case of the hyperspheroid), hyperparaboloid (elliptic or circular), hyperboloid (one or two sheets).

66

1 Object Recognition

In the parametric approach, where assumptions are made about class distributions, it is important to carefully extract the associated parameters through significant training sets and an interactive pre-analysis of sample data (histogram analysis, transformation to principal components, reduction of dimensionality, verification of significant features, number of significant classes, data normalization, noise attenuation, ...).

1.9 Mixtures of Gaussian—MoG The Gaussian distribution has some limitations in modeling real-world datasets. Very complex probability densities can be modeled with a linear combination of several suitably weighted Gaussians. The applications that make use of these mixtures are manifold, in particular, in the context of the modeling of the background for a video streaming sequence in video surveillance applications discriminant functions. The problem we want to solve is, therefore, the following: given a set of patterns n , we want to find the probability distribution p(x) which generated this P = {xi }i=1 set. In the implementation (framework) of the Gaussian mixtures, this generating distribution of the P dataset consists of a set composed of K Gaussian distributions (see Fig. 1.24); each of them is therefore: p(x; θ k ) ∼ N (xi |μk , k )

(1.148)

with μk and k , respectively, the mean and the covariance matrix of the k-th Gaussian distribution, defined as follows:  1 (1.149) N (xi |μk , k ) = (2π )−d/2 |k |−1/2 exp − (x − μk )k−1 (x − μk ) 2 that is, the probability of generating the observation x using the model k-th is provided by the Gaussian distribution having the parameter vector θ k = (μk , k ) (mean and covariance matrix, respectively).

Fig. 1.24 Example of probability density obtained as a mixture of 3 Gaussian distributions

p(x) Mixture of 3 Gaussians

x

1.9 Mixtures of Gaussian—MoG

67

We introduce in this model a further variable zi associated to the observation xi . The variable zi is a variable that indicates which of the K models has generated the data xi according to the a priori probability πk = p(z i = k). It is easy to think of the variable zi as a binary vector K-dimensional having the number 1 in the element corresponding to the Gaussian component that generated the corresponding data xi zi = [0' 0 . . . 0 () 1 0 . . . 0 0*]

(1.150)

K elements

Since we do not know the corresponding zi for each xi ,17 these variables are called hidden variables. Our problem is now reduced to the search for the parameters μk and k for each of the K models and the respective a priori probabilities πk that if incorporated in the generative model, it has a high probability of generating the observed distribution of data. The density of the mixture is given by p(xi |θ ) =

K 

πk N (x|μk , k )

(1.151)

k=1

Each Gaussian density: Pik = p(xi |z i = k, μk , k ) = N (xi |μk , k ) represents a component of the mixture having average μk and covariance k . The parameters πk are called coefficients of the mixture and serve to weigh the relative Gaussian component in modeling the generic random variable x. If we integrate both sides of (1.151) with respect to x, you can see that both p(x) that the individual components of the Gaussians are normalized, obtaining K 

πk = 1.

(1.152)

k=1

Also note that both p(x) ≥ 0 and N (x|μk , k ) ≥ 0 resulting in πk ≥ 0 for each k. Combining these last conditions with the (1.152), we get 0 ≤ πk ≤ 1

(1.153)

and so the mixture coefficients are probabilities. We are interested in maximizing the likelihood L(θ ) = p(P; θ ) that generates the observed data with the parameters of K . the model θ = {μk , k , πk }k=1 This approach is called ML—Maximum Likelihood estimate since it finds the parameters maximizing the likelihood function that generates the data. An alternative approach is given by the EM (Expectation–Maximization) algorithm, where the latter

17 If

we knew them, we would group all xi based on their zi and we would model each grouping with a single Gaussian.

68

1 Object Recognition

calculates θ maximizing the posterior probability (MAP—Maximum A Posterior estimate) and which presents a more simple mathematical treatment. So, to sum up, our vector can be estimated in the following two ways: θ M L = arg maxθ P(P|θ) θ M A P = arg maxθ P(θ |P)

(1.154)

1.9.1 Parameters Estimation of the Gaussians Mixture with the Maximum Likelihood—ML Our goal is, therefore, to find the parameters of the K Gaussian distributions and the coefficients πk based on the dataset P we have. Calculating the mixture density on the entire dataset of statistically independent measurements, we have p(X |θ ) =

n  K 

πk p(xi |z i = k, μk , k )

(1.155)

i=1 k=1

To find the parameter vector θ , a mode is represented by the maximum likelihood estimate. Calculating the logarithm of (1.155), we obtain L=

n  i=1

log

K 

πk p(xi |z i = k, μk , k )

(1.156)

k=1

which we will differentiate with respect to θ = {μ, } and then, we will annul to find the maximum, that is    ∂ log Pik ∂L πk ∂ Pik πk Pik ∂ log Pik = = = rik K K ∂θ k ∂θ k ∂θ k ∂θ k j=1 π j Pi j j=1 π j Pi j i=1 i=1 i=1 ' () * n

n

n

(1.157)

rik

in which we used the identity ∂ p/∂θ = p × ∂ log p/∂θ and defined rik the responsibility, i.e., the variable representing how likely the i-th point is modeled or explained by the Gaussian k-th, namely rik =

p(xi , z i = k|μk , k ) = p(z i = k|xi , μk , k ) p(xi |μk , k )

(1.158)

K rik = 1. Now, which are a posteriori probabilities of class membership with k=1 we have to calculate the derivative with respect to πk . Take the objective function L and add a Lagrange multiplier λ which reinforces the fact that the priori probabilities need to adapt to 1 K  πk ) (1.159) L˜ = L + λ(1 − k=1

1.9 Mixtures of Gaussian—MoG

69

therefore, we derive with respect to πk and λ, and we put to zero as before obtaining   ∂ L˜ Pik n = −λ=0⇔ rik − λπk = 0 ∂πk j=1 π j Pi j n

n

i=1

i=1

 ∂ L˜ =1− πk = 0 ∂λ

(1.160)

K

(1.161)

k=1

 n and considering that k=1 i=1 rik − λπk = n − λ = 0 we get λ = n, so the priori probability for the k-th class is given by πk =

n 1 rik n

(1.162)

i=1

We now find mean and covariance from the objective function L as follows:  ∂L = rik ∂μk n

(1.163)

i=1

∂L ∂k−1

=

n 

rik [k − (xi − μk )(xi − μk ) ]

(1.164)

i=1

By placing (1.163) and (1.164) at zero, we get μk =

n rik xi i=1 n i=1 rik

k =

n

−μ )(x −μk ) i=1 rik (x i n k i i=1 rik



(1.165)

n Note that the denominators i=1 rik = nπk represent the total number of points assigned to the k-th class. Furthermore, the mean in (1.165) is similar to that obtained for the k-means method except for the responsibilities rik which in this case are so f t (i.e., 0 ≤ rik ≤ 1).

1.9.2 Parameters Estimation of the Gaussians Mixture with Expectation–Maximization—EM A smart way to find maximum likelihood estimates for models of latent variables is characterized by the algorithm Expectation–Maximization or EM [20]. Instead of finding the maximum likelihood estimation (ML) of the observed data p(P; θ ), we n , will try to maximize the likelihood of the joint distribution of P and Z = {zi }i=1 p(P, Z; θ ). In this regard, we prefer to maximize the logarithm of likelihood, that is lc (θ ) = log p(P, Z; θ ),

70

1 Object Recognition

quantity known as complete log-likelihood. Since we cannot observe the values of the random variables zi , we have to work with the expected values of the quantity lc (θ) with respect to some distribution Q(Z). The logarithm of the complete likelihood function is defined as follows: lc (θ ) = log / p(P, Z; θ ) n p(P , z ; θ ) = log i=1 /n / K i i = log i=1 k=1 [ p(xi |z ik = 1; θ ) p(z ik = 1)]zik n  K = i=1 k=1 z ik log p(xi |z ik = 1; θ ) + z ik log πk .

(1.166)

Since we have assumed that each of the models is a Gaussian, the quantity p(xi |k, θ ) represents the conditional probability of generating xi given the model k-th: log p(xi |z ik = 1; θ) =

1 (2π )d/2 ||1/2

 1 exp − (xi − μk ) k−1 (xi − μk ) 2

(1.167)

Considering the expected value with respect to Q(Z), we have lc (θ ) Q(Z) =

n  K  z ik  log p(xi |z ik = 1; θ ) + z ik  log πk .

(1.168)

i=1 k=1

1.9.2.1 The M Step The “M” step considers the expected value of the lc (θ ) function defined in (1.168) and maximizes it with respect to the estimated parameters in the other step (the “E” expectation) which are, therefore, πk , μk , and k . Differentiating the Eq. (1.168) with respect to μk and setting to zero, we get  ∂lc (θ ) Q(Z) ∂ = z ik  log p(xi |z ik = 1; θ ) = 0 ∂μk ∂μk n

(1.169)

i=1

We can now calculate

∂ log p(xi |z ik = 1; θ ) ∂μk

using the (1.167) as follows: 



∂ ∂ 1 1  −1 ∂μk log p(xi |z ik = 1; θ) = ∂μk log (2π )d/2 | |1/2 exp − 2 (xi − μk ) k (xi − μk ) k −1 ∂ 1  = − 2 ∂μ (xi − μk ) k (xi − μk ) k = (xi − μk ) k−1

where the last equality derives from the relation replacing the result of (1.170) in (1.169), we get n  i=1

∂  ∂ x x Ax

z ik (xi − μk ) k−1 = 0



(1.170)

= x (A + A ). By

(1.171)

1.9 Mixtures of Gaussian—MoG

71

which gives us the following update equation: n z ik xi μk = i=1 n i=1 z ik 

(1.172)

Let us now calculate the estimate for the covariance matrix by differentiating Eq. (1.168) with respect to k−1 , we have ∂lc (θ) Q(Z) ∂k−1 We can calculate ∂ ∂k−1

=

n 

z ik 

i=1

∂ ∂k−1

log p(xi |z ik =1;θ )

∂ ∂k−1

log p(xi |z ik = 1; θ ) = 0

(1.173)

log p(xi |z ik = 1; θ ) using the (1.167) as follows: log



exp − 21 (xi −μk ) k−1 (xi −μk )

=

∂ ∂k−1

=

∂ ∂k−1

=

1 1  2 k − 2 (xi −μk )(xi −μk )



1 2

1 (2π )d/2 ||1/2

log |k−1 |− 21 (xi −μk ) k−1 (xi −μk )



(1.174)

where the last equality is obtained from the following relationships: ∂ log |P| = (P −1 ) ∂P

and

∂  x Ax = xx ∂A

Replacing this result of the (1.174) in the (1.173), we get n  i=1



1 1 k − (xi − μk )(xi − μk ) z ik  2 2

 =0

(1.175)

which gives us the update equation for the covariance matrix for the k-th component of the mixture: n z ik (xi − μk )(xi − μk ) n k = i=1 (1.176) i=1 z ik  Now, we need to find the update equation of the prior probability πk for the kth component of the mixture. This means maximizing the expected value of the logarithm function of the likelihood lc  (Eq. 1.168) subject to the constraint that  k πk = 1. To do this, we introduce the Lagrange multipliers λ by increasing the (Eq. 1.168) as follows: 0 K 1  πk − 1 (1.177) L(θ ) = lc (θ ) Q(Z) − λ k=1

72

1 Object Recognition

By differentiating this expression from πk , we get ∂ lc (θ ) Q(Z) − λ = 0 ∂πk

1 ≤ k ≤ K.

(1.178)

Using the (1.168), we have 1 πk

or equivalently

n

i=1 z ik  − λ

n

i=1 z ik  − λπk

⎫ = 0⎬ =0



1≤k≤K

(1.179)

Now adding Eq. (1.179) to all K models, we have K  n 

z ik  − λ

k=1 i=1

and since

K

k=1 πk

K 

πk = 0

(1.180)

k=1

= 1, we have λ=

K  n 

z ik  = n

(1.181)

k=1 i=1

Replacing this result in Eq. (1.179), we obtain the following update formula: n z ik  (1.182) πk = i=1 n K which retains the constraint k=1 πk = 1.

1.9.2.2 The E Step Now that we have derived the parameter updating formulas that maximize the expected value of the complete likelihood function log p(P, Z; θ ), we must also make sure that we are maximizing the incomplete version log p(P; θ ) (which is actually the quantity that we really want to maximize). As mentioned above, we are sure to maximize the incomplete version of the logarithm of the likelihood, only when we consider the expected value of the posterior distribution of Z, or p(Z|P; θ ). Thus, each of the expected values z ik  appearing in the update equation derived in the previous paragraph should be calculated as follows:

1.9 Mixtures of Gaussian—MoG

73

z ik  p(Z|P ;θ ) = 1 · p(z ik = 1|xi ; θ ) + 0 · p(z ik = 0|xi ; θ ) = p(z ik = 1|xi ; θ ) =

p(x |z =1;θ) p(z ik =1)  K i ik j=1 p(xi |z i j =1;θ ) p(z i j =1)

=

p(x |z =1;θ)πk  K i ik j=1 p(xi |z i j =1;θ )π j

(1.183)

1.9.3 EM Theory Since it is difficult to analytically maximize the quantity log p(P; θ ), we therefore choose to maximize the complete version of the logarithm of likelihood log p(P, Z; θ ) Q(Z) by using a theorem known as Jensen’s inequality. Our goal is, therefore, to maximize log p(P, Z; θ ) Q(Z) with the hope of maximizing also its incomplete version log p(P; θ ) (which in fact represents the quantity we are interested in maximizing). Before going into this justification, we introduce the inequality of Jensen in the next paragraph.

1.9.3.1 Jensen’s Inequality If f (x) is a convex function (Fig. 1.25), defined in an interval (a, b), then ∀x1 , x2 ∈ (a, b) ∀λ ∈ [0, 1] : λ f (x1 )+(1−λ) f (x2 ) ≥ f [λx1 +(1−λ)x2 ] (1.184) If alternatively f (x) is a concave function, then λ f (x1 ) + (1 − λ) f (x2 ) ≤ f [λx1 + (1 − λ)x2 ]

(1.185)

That is, if we want to find the value of the function between the two points x1 and x2 , we say x ∗ = λx1 +(1−λ)x2 , then the value of f (x ∗ ) will be found below the joining f (x1 ) and f (x2 ) (in the case it is convex, vice versa if concave). We are interested

Fig. 1.25 Representation of a convex function for Jensen’s inequality

f(x1)+(1- )f(x2)

f(x*)

x1

x*= x1+(1- )x2

x2

74

1 Object Recognition

in evaluating the logarithm which is actually a concave function and for which we will consider the last inequality (1.185). We rewrite log p(P; θ ), as follows:  log p(P; θ ) = log p(P, Z; θ )dZ (1.186) Now let’s multiply and divide by an arbitrary distribution Q(Z) in order to find a lower bound of the log p(P; θ ), and we use the result of Jensen’s inequality to continue with Eq. (1.186): log



p(P , Z; θ)dZ = log ≥





,Z;θ) Q(Z) p(P Q(Z) dZ

Q(Z) log

p(P ,Z;θ) Q(Z) dZ

Jensen



=

 Q(Z) log p(P , Z; θ)dZ − Q(Z) log Q(Z)dZ ' () * ' () * expected value log-likelihood

(1.187)

Entropy of Q(Z)

= log p(P , Z; θ) Q(Z) + H[Q(Z)] = F (Q, θ)

So we get to the following lower extreme of the function log p(P; θ ): log p(P; θ ) ≥ F(Q, θ )

(1.188)

Since Q(Z) is an arbitrary distribution, it is independent of θ , and therefore, to maximize the functional F(Q, θ ), it is sufficient to maximize log p(P, Z; θ ) Q(Z) (hence the step “M”). Even if we found an extreme lower F(Q, θ ) for the log p(P; θ ) function, this does not imply that at every step the improvement in the search for the maximum of F has repercussions in the improvement of the maximum of log p(P; θ ). If, instead, we set Q(Z) = p(Z|P; θ ) in (1.187), we can observe that the lower extreme becomes a minimum, that is, an equality as follows: 

Q(Z) log

p(P ,Z;θ ) Q(Z) dZ

= = =

  

p(Z|P; θ ) log

p(P ,Z;θ ) p(Z|P ;θ) dZ

p(Z|P; θ ) log

p(Z|P ;θ) p(P ;θ ) dZ p(Z|P ;θ )

p(Z|P; θ ) log p(P; θ )dZ

= log p(P; θ )



(1.189)

p(Z|P; θ )dZ

= log p(P; θ ) This means that when we calculate the expected value of the complete version of the logarithm function of the likelihood log p(P, Z; θ ) Q(Z) , it should be taken with respect to the true posteriori probability (Z|P; θ ) of the hidden variables (hence the step “E”).

1.9 Mixtures of Gaussian—MoG

75

Assume a model having: observable (or visible) variables x; unobservable (hidden) variables y; and the relative vector of parameters θ . Our goal is to maximize the logarithm of the likelihood with respect to the variable θ containing the parameters:  L(θ ) = log p(x|θ ) = log p(x, y|θ )dy (1.190) Now let’s multiply and divide by the same arbitrary distribution q(y) defined on the latent variables y. Now we can take advantage of Jensen’s inequality, since we have a convex weighted combination of q(y) of some combination of functions. In practice, we consider as f (y) a function of latent variables y as indicated in (1.191). Any distribution q(y) on the hidden variables can be used to get a lower limit of the log of the likelihood function:   p(x, y|θ ) p(x, y|θ ) dy ≥ q(y) log dy = F(q, θ ) (1.191) L(θ ) = log q(y) q(y) q(y) This lower bound is called Jensen’s inequality and derives from the fact that the logarithm function is concave.18 In the EM algorithm, we alternatively optimize F(q, θ ) with respect to q(y) and θ. It can be proved that this mode of operation will never decrease L(θ ). In summary, the EM algorithm alternates between the following two steps: 1. Step E optimizes F(Q, θ ) with respect to the distribution of the hidden variables maintaining the fixed parameters: Q (k) (z) = arg max F(Q(z), θ (k−1) ). Q(z)

(1.192)

2. Step M maximizes F(Q, θ ) with respect to the parameters keeping the distribution of the hidden variables fixed: θ (k) = arg maxF (Q(z), θ (k−1) ) = arg max θ

θ



Q (k) (z) log p(x, z|θ)dz

(1.193)

where the second equality derives from the fact that the entropy of q(z) does not depend directly on θ . The intuition, that is, the basis of the EM algorithm, can be schematized as follows: Step E finds the values of the hidden variables according to their posterior probabilities; Step M learns the model as if the hidden variables were not hidden. The EM algorithm is very useful in many contexts, since in many models, if the hidden variables are not anymore, learning becomes very simple (in the case of Gaussian

18 The

logarithm of the average is greater than the averages of the logarithms.

76

1 Object Recognition

mixtures). Furthermore, the algorithm breaks down the complex learning problem into a sequence of simpler learning problems. The pseudo-code of the algorithm EM for Gaussian mixtures is reported in Algorithm 3. Algorithm 3 EM algorithm for Gaussian mixtures 1: Initialize z ik , πk , μk and k for 1 ≤ k ≤ K 2: repeat 3: 4:

for i = 1, 2, . . . , n do for k = 1, 2, . . . , K do  1 p(xi |z ik = 1; θ) = (2π )−d/2  −1/2 exp − (x − μ)  −1 (x − μ) 2

5:

6:

7: 8: 9:

p(xi |z ik = 1; θ)πk z ik  =  K j=1 p(xi |z i j = 1; θ)π j end for end for for k = 1, 2, . . . , K do n

10: k =

i=1 z ik (xi − μk )(xi n i=1 z ik 

11:

n z ik xi μk = i=1 n i=1 z ik 

12:

n πk =

13:

− μ k )

i=1 z ik 

n

end for

14: until convergence of parameters

1.9.4 Nonparametric Classifiers In the preceding paragraphs, we have explicitly assumed the conditional probability densities p(x|ωi ) and the priori probability density p(x) or have taken note in terms of parametric form (for example, Gaussian) and estimated (for example, with

1.9 Mixtures of Gaussian—MoG

77

the maximum likelihood (ML) method), with a supervised approach, the relative characteristic parameters. Moreover, these densities have been considered unimodal when in real applications, they are always multimodal. The extension with Gaussian mixtures is possible even if it is necessary to determine the number of the components and hope that the algorithm (for example, E M) that estimates the relative parameters converges toward a global optimum. With the nonparametric methods, no assumption is made on the knowledge of the density functions of the various classes that can take arbitrary forms. These methods can be divided into two categories, those based on the Density Estimation—DE and those that explicitly use the features of the patterns considering the training sets significant for the classification. In Sect. 1.6.5, some of these have been described (for example, the k-Nearest-Neighbor algorithm). In this section, we will describe the simple method based on the Histogram and the more general form for estimating density together with the Parzen Window.

1.9.4.1 Estimation of Probability Density The simplest way to have a nonparametric D E is given by the histogram. The histogram method partitions the pattern space into several distinct containers (bin) with width i and approximates the density pi (x) at the center of each bin with the fraction of the n i patterns of training set P = {x1 , . . . , x N } that fall into the corresponding i-th bin: ni pi (x) = (1.194) N i where the density is constant over the whole width of the bin which is normally chosen for all with the same dimension, i =  (see Fig. 1.26). With (1.194), the objective is to model the normalized density p(x) from the observed N patterns P.

Fig. 1.26 The density of the histogram seen as a function of the width of the bin . With large values of , the resulting density is very smooth and the correct form of the p(x) distribution is lost in this case consisting of the mixture of two Gaussians. For very small values of , we can see very pronounced and isolated peaks that do not reproduce the true distribution p(x)

5

0

Δ=0.04

0

5

0.5

1

0.5

1

0.5

1

Δ=0.08

0 0 5

Δ=0.25

0 0

78

1 Object Recognition

From the figure, it can be observed how the approximation p(x) can be attributable to a mixture of Gaussians. The approximation p(x) is characterized by . For very large values, the density is too level and the bimodal configuration of p(x) is lost, but for very small values of , a good approximation of p(x) is obtained by recovering its bimodal structure. The histogram method to estimate p(x), although very simple to calculate the estimate from the training set, as the pattern sequence is observed, has some limitations: (a) Discontinuity of the estimated density due to the discontinuity of the bin rather than the intrinsic property of the density. (b) Problem of scaling the number of bin with pattern to d-dimension, we would have M d bin with M bin for each dimension. Normally the histogram is used for a fast qualitative display (up to 3 dimensions) of the pattern distribution. A general formulation of the D E is obtained based on the probability theory. Consider pattern samples x ∈ Rd with associated density p(x). Both  is a bounded region in the feature domain, the P probability that a x pattern falls in the region  is given by  P= p(x )dx (1.195) 

P can be considered in some way an approximation of p(x), that is, as an average over the region  of p(x). The (1.195) is useful for estimating the density p(x). Now suppose we have a training set P of N pattern independent samples (iid) xi , i = 1, . . . , N associated with the distribution p(x). The probability that k of these N samples fall in the  region is given by the binomial distribution:   N P(k) = P k (1 − P) N −k (1.196) k It is shown by the properties of the binomial distribution that the average and the variance with respect to the ratio k/N (considered as random variable) are given by P

-k. N

=P

V ar

-k. N

=E

- k N

−P

2 .

=

P(1 − P) N

(1.197)

For N , the distribution becomes more and more peaked with small variance (var (k/N ) → 0) and we can expect a good estimate of the probability P that can be obtained from the mean of the sample fraction which fall into : k P∼ = N

(1.198)

1.9 Mixtures of Gaussian—MoG

79

At this point, if we assume that the  region is very small and p(x) continuous which does not vary appreciably within it (i.e., approximately constant), we can write   p(x )dx ∼ 1dx ∼ (1.199) = p(x)V = p(x) 



where x is a pattern inside  and V is the volume enclosed by the  region. By virtue of the (1.195), (1.198), and the last equation, combining the results, we obtain  k/N P =  p(x )dx ∼ = p(x)V =⇒ p(x) ∼ (1.200) = k P∼ = N V In essence, this last result assumes that the two approximations are identical. Furthermore, the estimation of the density p(x) becomes more and more accurate with the increase in the number of samples N and the simultaneous contraction of volume V . Attention, this leads to the following contradiction: 1. Reducing the volume implies the reduction of the sufficiently small region  (with the density approximately constant in the region) but with the risk of zero samples falling. 2. Alternatively, with sufficiently large  would suffice k samples sufficient to produce an accentuated binomial peak (see Fig. 1.27). If instead the volume is fixed (and consequently ) and we increase the number of samples of the training set, then the ratio k/N will converge as desired. But this only produces an estimate of the spatial average of the density:  p(x )dx P =  (1.201)  V  1dx In reality, we cannot have V very small considering that the number of samples N is always limited. It follows that we should accept that the density estimate is a spatial average associated with a variance other than zero. Now let’s see if these limitations can be avoided when an unlimited number of samples are available. To evaluate p(x) in x, let us consider a sequence of regions 1 , 2 , . . . containing patterns x with 1 having 1 sample, 2 having 2 samples, and so on. Let Vn be the volume of n , let kn be the number of falling samples in n , and let pn (x) be the n-th estimate of p(x), we have kn /N (1.202) pn (x) = Vn

80 Fig. 1.27 Convergence of probability density estimation. The curves are the binomials associated with the N number of the samples. As N grows, the binomial associate has a peak at the correct value P = 0.7. For N → ∞, the curve tends to the delta function

1 Object Recognition

Relative probability 1

0.5

20 50 0

100 P=0.7

1

k/N

If we want pn (x) to converge to p(x), we need the following three conditions: lim Vn = 0

(1.203)

lim kn = ∞

(1.204)

lim kn /N = 0

(1.205)

N →∞ N →∞ N →∞

The (1.203) ensures that the spatial average P/V converges to p(x). The (1.204) essentially ensures that the ratio of the frequencies k/N converges to the probability P with the binomial distribution sufficiently peaked. The (1.205) is required for the convergence of pn (x) given by the (1.202). There are two ways to obtain regions that satisfy the three conditions indicated above (Eqs. (1.203), (1.204), and (1.205)): by specifying the volume Vn as 1. Parzen windows reduce the n region initially √ a function of N , for example, Vn = 1/ N , and show that pn (x) converges to p(x) for N → ∞. In other words, the region is fixed and therefore, the volume to make the estimate without directly considering the number of samples included. √ 2. knn-nearest neighbors specify kn as a function of N , for example, kn = N and then increase the volume Vn until in the associated region n fall down kn neighboring samples of x. Figure 1.28 shows a graphical representation of the two methods. The two sequences represent random variables that normally converge and allow us to estimate the probability density at a given point in the circular region.

1.9 Mixtures of Gaussian—MoG

(a)

81

n=1

n=4

n=9

n=18

Vn=1/√n

(b) kn=√n

Fig. 1.28 Two methods for estimating density, that of Parzen windows (a) and that of knn-nearest neighbors (b). The two sequences represent random variables that generally converge to estimate the probability density at a given point in the circular (or square) region. The Parzen method starts with a large initial value of the region which decreases as n increases, while the knn method specifies a number of samples kn and the region V increases until the predefined samples are included near the point under consideration x

1.9.4.2 Parzen Window The density estimation with the Windows of Parzen (also known as the Kernel Density Estimation—KDE) is based on windows of variable size assuming that the n region that includes kn samples is a hypercube with side length h n centered in x. With reference to the (1.202), the goal is to determine the number of samples kn in the hypercube fixed its volume. The volume Vn of this hypercube is given by Vn = h dn

(1.206)

where d indicates the dimensionality of the hypercube. To find the number of samples kn that fall within the region n is defined the window function ϕ(u) (also called kernel function)  1 if |u j | ≤ 1/2 ∀ j = 1, . . . , d ϕ(u) = (1.207) 0 otherwise This kernel, which corresponds to a unit-centered hypercube, is known as a Parzen window. It should be noted that the following expression has a unitary value:  ϕ

x − xi hn



 =

1 if xi falls into the hypercube of volume Vn of the side h n centered on x 0 otherwise

(1.208)

It follows that the total number of samples inside the hypercube is given by   N  x − xi kn = ϕ hn i=1

(1.209)

82

1 Object Recognition

(a) p(x)

x=0.5

p(0.5)= 1 7

2

p(x)

x=1

p(1)= 1 7

2

p(x)

3

x=2

2

p(x)

3 p(2)= 1 7

3

7

Σ φ( x ) i=1

1 3

5 7

0.53

6

i

7

9

Σ φ( x ) i=1

1 3

5 7

13

6

i

7

5

1 3

23

6

7

i

(b) 10

x

= 1 21

9

Σ φ( x ) i=1

= 1 21

5

0 10

x

= 2 21

hn=0.005 0

5

1

0.5

1

0.5

1

hn=0.08

0 0

9

0.5

10

x

5

hn=0.2

x=3 x=4 x=5 x=6 x=7 x=8 x=9 x=10

3 21 2 21 1 21

0 0 2

3

5

6

7

9

10

x

Fig. 1.29 One-dimensional example of calculation of density estimation with Parzen windows. a The training set consists of 7 samples P = {2, 3, 5, 6, 7, 9, 10}, the window has width h n = 3. The estimate is calculated with the (1.210) starting from x = 0.5 and subsequently the window is centered in each sample obtaining finally the p(x) as the sum of 7 rectangular functions each of height 1/nh d = 1/7 · 31 = 1/21. b Analogy of the density estimation carried out with the histogram and with the rectangular windows (hyper-cubic in the case of d-dimensions) of Parzen where we observe strong discontinuities of p(x) with very small h n (as happens for small values of  for the bins) while obtaining a very smooth shape of p(x) for large values of h n in analogy to the histogram for high values of 

By replacing the (1.209) in the Eq. (1.202), we get the KDE density estimate: pn (x) =

  N kn /N 1  1 x − xi = ϕ Vn N Vn hn

(1.210)

i=1

The kernel function ϕ, in this case called the Parzen window, tells us how to weigh all the samples in n to determine the density pn (x) with respect to a particular sample x. The density estimation is obtained as the average of the kernel functions of x and xi . In other words, each sample xi contributes to the estimation of the density in relation to the distance from x (see Fig. 1.29a). It is also observed that the Parzen window has an analogy with the histogram with the exception that the bin locations are determined by the samples (see Fig. 1.29b). Now consider a more general form of the kernel function instead of the hypercube. You can think of a kernel function seen as an interpolator placed in the various samples xi of training set P instead of considering only the position x. This means that the kernel function ϕ must satisfy the density function conditions, that is, be nonnegative and the integral equal to 1:  ϕ(x) ≥ 0 ϕ(u)du = 1 (1.211)

1.9 Mixtures of Gaussian—MoG

83

For the hypercube previously considered with the volume Vn = h dn , it follows that the density pn (x) satisfies the conditions indicated by the (1.211): 

 pn (x)dx =

  N 1  1 x − xi dx ϕ N Vn hn i=1

=

  N N  x − xi 1  1 1  Vn = 1 ϕ dx = Vn N hn N Vn i=1 ' i=1 () *

(1.212)

hypercube volume

If instead we consider an interpolating kernel function that satisfies the density conditions, (1.211), integrating by substitution, putting u = (x − xi )/ h n for which du = dx/ h n , we get: 

 pn (x)dx =

  N 1  1 x − xi dx ϕ N Vn hn i=1

N  N  1  1  d = h n ϕ(u)du = ϕ(u)du = 1 Vn N N i=1

(1.213)

i=1

Parzen windows based on hypercube have several drawbacks. In essence, they produce a very discontinuous density estimate and the contribution of each sample xi is not weighted in relation to its distance with respect to the point x in which the estimate is calculated. For this reason, the Parzen window is normally replaced with a kernel function with the smoothing feature. In this way, not only the number of samples falling in the window are counted but their contribution is weighed with the interpolating function. With a number of samples N → ∞ and choosing an appropriate window size, it is shown that it converges toward the true density pn (x) → p(x). The most popular choice falls on the Gaussian kernel: 1 2 ϕ(u) = √ e−u /2 2π

(1.214)

considering a one-dimensional density with zero mean and unit variance. The resulting density, according to KDE (1.210), is given by pϕ (x) =

N . - 1 1 1  √ exp − 2 (x − xi )2 N 2h n h 2π i=1 n

(1.215)

The Parzen Gaussian window eliminates the problem of the discontinuity of the rectangular window. The champions of the training set P which are closest to the sample xi have higher weight thus obtaining a density pϕ (x) smoothed (see Fig. 1.30). It is observed how the shape of the estimated density is modeled by the Gaussian kernel functions located on the observed samples.

84 Fig. 1.30 Use of Parzen windows with Gaussian kernels centered in each sample of the training set for density estimation. In the one-dimensional example the form of the density pϕ (x) is given by the sum of the 7 Gaussians, each centered on the samples and scaled by a factor 1/7

1 Object Recognition Estimated density p(x)

Kernel functions

2

3

5

6

7

9

10

x

Sample patterns

We will now analyze the influence of the window width (also called smoothing parameter) h n on the final density result pϕ (x). A large value tends to produce a more smooth density by altering the structure of the samples, on the contrary, small values of h n tend to produce a very peaked density function resulting in complex interpretation. To have a quantitative measure of the influence of h n , we consider the function δ(x) as follows: δn (x) =

1 x 1 x = dϕ ϕ Vn h n hn hn

(1.216)

where h n affects the horizontal scale (width) while the volume h dn affects the vertical scale (amplitude). Also the function δn (x) satisfies the conditions of density function, in fact operating the integration with substitution and putting u = x/ h n , we get     1 1 x d dx = δn (x)dx = ϕ ϕ(u)h du = ϕ(u)du = 1 (1.217) n h dn hn h dn By rewriting the density as an average value, we have pn (x) =

N 1  δn (x − xi ) N

(1.218)

i=1

The effect of the h n parameter (i.e., the volume Vn ) on the function δn (x) and consequently on the density pn (x) results as follows: (a) For h n that tends toward high values, a contrasting action is observed on the function δn (x) where on one side there is a reduction of the vertical scale factor (amplitude) and an increase of the scale factor horizontal (width). In this case, we get a poor (very smooth) resolution of the density pn (x) considering that it will be the sum of many delta functions centered on the samples (analogous to the convolution process). (b) For h n that tends toward small values, δn (x) becomes very peaked and pn (x) will result in the sum of N peaked pulses with high resolution and with an estimate

1.9 Mixtures of Gaussian—MoG

N=1

0.4

1

4

0.2

0.5

2

0 −2 1

N=10

85

0

2

0.5

0 −2 1

2

0

0

2

0 −2 1

0 −2 1

0

2

0 −2 1

0

2

0 −2

0

2

0 −2

0 −2 1

0

2

0

2

0

2

0.5

2

0

0 −2 1 0.5

0.5

0.5

2

0.5

0.5

0.5

0

1

0.5

0 −2 1

0 −2 1.5

2

0

0 −2

Fig. 1.31 Parzen window estimates associated to a univariate normal density for different values of the parameters h 1 and N , respectively, window width and number of samples. It is observed that for very large N , the influence of the window width is negligible

affected by a lot of statistical variability (especially in the presence of noisy samples). The theory suggests that for an unlimited N of samples, with the volume Vn tending to zero, the density pn (x) converges toward an unknown density p(x). In reality, having a limited number of samples, the best that can be done is to find a compromise between the choice of h n and the limited number of samples. Considering the training set of samples P = (x1 , . . . , x N ) as random variables on which the density pn (x) depends, for any value of x, it can be shown that if pn (x) has an estimate of the average pˆ n (x) and an estimate of the variance σˆ n2 (x), it is proved [18] that lim σˆ n2 (x) = 0 (1.219) lim pˆ n (x) = p(x) n→∞

n→∞

Let us now consider a training set of samples i.i.d. deriving from a normal distribution where p(x) → N (0, √ 1). If a Parzen Gaussian window given by the (1.214) is used, setting h n = h 1 / N where h 1 is a free parameter, the resulting estimate of the density is given by the (1.215) which is the average of the normal density centered in the samples xi . Figure 1.31 shows the estimate of the true density p(x) = N (0, 1) using a Parzen window with Gaussians as the free parameter varies h 1 = 1, 0.4, 0.1 and the number of samples N = 1, 10, 100, ∞. It is observed as in the approximation

86

1 Object Recognition

process, starting with a single sample N = 1 the Gaussian window centered on it while varying h 1 , the mean and variance of pn (x) are different from the density true. As the samples grow to infinity, the pn (x) tends to converge to the true density p(x) of the training set regardless of the free parameter h 1 . It should be noted that in the approximation process for values of h 1 = 0.1, i.e., with a small window width, the contribution of the individual samples is distinguished by highlighting their possible noise. From the examples, it can be seen that it is strategic to find the best value of h n , possibly adapting it to various training sets.

1.9.4.3 Classifier with Parzen Windows A Parzen window-based classifier basically works as follows: (a) The conditional densities are estimated for each class p(x|ωi ) (Eq. (1.215)) and the test samples are classified according to the corresponding maximum posterior probability (Eq. (1.70)). Eventually the priori probabilities can be considered in particular when they are very different. (b) The decision regions for this type of classifier depend a lot on the choice of the kernel function used. A 2D binary classifier based on Parzen windows is characterized by the choice of the width h n of the window which influences the decision regions. In the learning phase, you can choose small values of h n while keeping the classification errors to a minimum but obtaining very complex regions with a resolution that in the test phase (final important phase of the classification), you can have problems known as generalization error. For large values of h n , the classification on training sets is not perfect but simpler decision regions are obtained. This last solution tends to minimize the generalization error in the test phase with new samples. In these cases, you can use a cross-validation19 approach since there is no robust theory for the choice of the exact width of the window. In conclusion, a Parzen window-based is a good method applicable on samples derivable from any distribution and in theory proves that it converges to the true density as the number of samples tends to infinity. The negative aspect of this classifier

19 Cross-validation is a statistical technique that can be used in the presence of an acceptable number

of the observed sample (training set). In essence, it is a statistical method to validate a predictive model. Taken a sample of data, it is divided into subsets, some of which are used for the construction of the model (the training sets) and others to be compared with the predictions of the model (the validation set). By mediating the quality of the predictions between the various validation sets, we have a measure of the accuracy of the predictions. In the context of classification, the training set consists of samples of which the class to which they belong is known in advance, ensuring that this set is significant and complete, i.e., with a sufficient number of representative samples of all classes. For the verification of the recognition method, a validation set is used, also consisting of samples whose class is known, used to check the generalization of the results. It consists of a set of samples different from those of the training set.

1.9 Mixtures of Gaussian—MoG

87

concerns the limited number of samples available in concrete applications and this implies a not easy choice of h n . It also requires a high computational complexity. The classification of a single sample requires the calculation of the function that potentially depends on all the samples. The number of samples grows exponentially as the feature space increases.

1.10 Method Based on Neural Networks The Neural Network—NN has seen an explosion of interest over the years, and is successfully applied in an extraordinary range of sectors, such as finance, medicine, engineering, geology, and physics. Neural networks are applicable in practically every situation in which the relationship between prediction (independent of input) and predicted (output dependent) variables exists, even when this relationship is very complex and not easy to define in terms of correlation or similarity between various classes. In particular, they are also used for the classification problem that aims to determine which of a defined number of classes a given sample belongs to.

1.10.1 Biological Motivation Studies on neural networks are inspired by the attempt to understand the functioning mechanisms of the human brain and to create models that mimic this functionality. This has been possible over the years with the advancement of knowledge in neurophysiology which has allowed various physicists, physiologists, and mathematicians to create simplified mathematical models, exploited to solve problems with new computational models called neur ocomputing. Biological motivation is seen as a source of inspiration for developing neural networks by imitating the functionality of the brain regardless of its actual model of functioning. The nervous system plays the fundamental role of intermediary between the external environment and the sensory organs to guarantee the appropriate responses between external stimuli and internal sensory states. This interaction occurs through the receptors of the sense organs, which, excited by the external environment (light energy, ...), transmit the signals to other nerve cells which are in turn processed, producing informational patterns useful to the executing organs (effectors, an organ, or cell that acts in response to a stimulus). The neur on is a nerve cell, which is the basic functional construction element of the nervous system, capable of receiving, processing, storing and transmitting information. Ramòn y Cajal (1911) introduced the idea of neurons as elements of the structure of the human brain. The response times of neurons are 5–6 orders of magnitude slower than the gates of silicon circuits. The propagation of the signals in a silicon chip is of a few nanoseconds (10−9 s), while the neural activity propagates with times of the order of milliseconds (10−3 s). However, the human brain is made up of about 100

88

1 Object Recognition Soma

Dendrites

Nucleus Axon Myelin sheaths Synaptic terminals

Nodes of Ranvier Impulses direction

Fig. 1.32 Biological structure of the neuron

billion (1012 ) nerve cells, also called neur ons and interconnected with each other up to 1 trillion (1018 ) of special structures called synapses or connections. The number of synapses per neuron can range from 2,000 to 10,000. In this way, the brain is in fact a massively parallel, efficient, complex, and nonlinear computational structure. Each neuron constitutes an elementary process unit. The computational power of the brain depends above all on the high degree of interconnection of neurons, their hierarchical organization, and the multiple activities of the neurons themselves. This organizational capacity of the neurons constitutes a computational model that is able to solve complex problems such as object recognition, perception, and motor control with a speed considerably higher than that achievable with traditional largescale computing systems. It is in fact known how spontaneously a person is able to perform functions of visual recognition for example of a particular object with respect to many other unknowns, requiring only a few milliseconds of time. Since birth, the brain has the ability to acquire information about objects, with the construction of its own rules that in other words constitute knowledge and experience. The latter is realized over the years together with the development of the complex neural structure that occurs particularly in the first years of birth. The growth mechanism of the neural structure involves the creation of new connections (synapses) between neurons and the modification of existing synapses. The dynamics of development of the synapses is 1.8 million per second (from the first 2 months of birth to 2–3 years of age then it is reduced on average by half in adulthood). The structure of a neuron is schematized in Fig. 1.32. It consists of three main components: the cellular body or soma (the central body of the neuron that includes the genetic heritage and performs cellular functions), the axon (filiform nerve fiber), and the dendrites. A synaptic connection is made by the axons which constitute the transmission lines of out put of the electrochemical signals of neurons, whose signal reception structure (input) consists of the dendrites (the name derives from the similarity to the tree structure) that have different ramifications. Therefore, a neuron can be seen as an elementary unit that receives electrochemical impulses from different dendrites and once processed in the soma several electrochemical impulses are transmitted to other neurons through the axon. The end of the latter branches forming terminal

1.10 Method Based on Neural Networks

89

fibers from which the signals are transmitted to the dendrites of other neurons. The transmission between axon and dendrite of other neurons does not occur through a direct connection but there is a space between the two cells called synaptic fissure or cleft or simply synapse. A synapse is a junction between two neurons. A synapse is configured as a mushroom-shaped protrusion called the synaptic node or knob that is modeled from the axon to the surface of the dendrite. The space between the synaptic node and the dendritic surface is precisely the synaptic fissure through which the excited neuron propagates the signal through the emission of fluids called neurotransmitters. These come into contact with the dendritic structure (consisting of post-synaptic receptors) causing the exchange of electrically charged ions-atoms (entering and leaving the dendritic structure) thus modifying the electrical charge of the dendrite. In essence, an electrical signal is propagated from the axon, a chemical transmission is propagated in the synaptic cleft, and then an electrical signal is propagated in the dendritic structure. The body of the neuron receiving the signals from its dendrites processes them by adding them together and triggers an excitator y response (increasing the frequency of discharge of the signals) or inhibitor y (decreasing the discharge frequency) in the post-synaptic neuron. Each post-synaptic neuron accumulates signals from other neurons that add up to determine its excitation level. If the excitation level of the neuron has reached a threshold level limit, this same neuron produces a signal guaranteeing the further propagation of the information toward other neurons that repeat the process. During the propagation of each signal, the synaptic permeability, as well as the thresholds, are slightly adapted in relation to the signal intensity, for example, the activation threshold ( f iring) is lowered if the transfer is frequent or is increased if the neuron has not been stimulated for a long time. This represents the plasticity of the neuron, that is, the ability to adapt to stimulations-stresses that lead to reorganize the nerve cells. The synaptic plasticity leads to the continuous remodeling of the synapses (removal or addition) and is the basis of the learning of the brain’s abilities in particular during the period of development of a living organism (in the early years of a child are formed new synaptic connections with the frequency of one million per second), during adult life (plasticity is reduced), and also in the phases of functional recovery after any injuries.

1.10.2 Mathematical Model of the Neural Network The simplified functional scheme, described in the previous paragraph, of the biological neural network is sufficient to formulate a model of artificial neural network (ANN) from the mathematical point of view. An ANN can be made using electronic components or it can be simulated to software in traditional digital computers. An ANN is also called neuro-computer, connectionist network, parallel distributed processor (PDP), associative network, etc.

90

1 Object Recognition

(a)

(b)

T

x ∑ ξ

σ

y=σ(ξ)



xn bias

threshold

||

x

w || |w

y

wn

g(x)=0

w

w /||

θ

g(x)>0

ξ

g(x) 1 if 0 ≤ ξ ≤ 1 if ξ < 0

(1.224)

3. Standard sigmoid or logistic: 1 1 + e−ξ

(1.225)

1  1 − e−ξ ξ = tanh 1 + e−ξ 2

(1.226)

σ (ξ ) = 4. Hyperbolic tangent: σ (ξ ) =

All activation functions assume values between 0 and 1 (except in some cases where the interval can be defined between −1 and 1, as is the case for the hyperbolic tangent activation function). As we shall see later, when we analyze the learning methods of a neural network, these activation functions do not properly model the functionality of a neuron and above all of a more complex neural network. In fact, synapses modeled with simple weights are only a rough approximation of the functionality of biological neurons that are a complex nonlinear dynamic system. In particular, also the nonlinear activation functions of sigmoid and hyperbolic tangent are inadequate when they saturate around the extremes of the interval 0 or 1 (or −1 and +1) where the gradient in these regions tends to vanish. This results in the non-operation of the activation functions and no output signal is generated by the neuron with the consequent block of the update of the weights. Other aspects concern the interval of the activation functions with the nonzero-centered output as is the case for the sigmoid activation function and the slow convergence. Also, the function of hyperbolic tangent, although with zero-centered output, presents the problem of saturation. Despite these limitations in the applications of machine learning, sigmoid and hyperbolic tangent functions were frequently used. In recent years, with the great diffusion of deep learning have become very popular new activation functions: ReLu, Leaky ReLu, Parametric ReLU, and ELU. These functions are simple and exceed the limitations of previous functions. The description of these new activations functions is reported in the Sect. 2.13 of Deep Learning.

1.10 Method Based on Neural Networks

93

1.10.3 Perceptron for Classification The neuron model of McCulloch and Pitts (MP) does not learn, the weights and thresholds are analytically determined and operate with binary and discrete input and output signal values. The first successful neuro-computational model was the per ceptr on devised by Rosenblatt based on the neuron model defined by MP. The objective of the perceptron is to classify a set of input patterns (stimuli) x = (x1 , x2 , . . . , x N ) into two classes ω1 and ω2 . In geometric terms, the classification is characterized by the hyperplane that divides the space of the input patterns. This hyperplane is determined by the linear combination of the weights w = (w1 , w2 , . . . , w N ) of the perceptron with the f eatur es of the patter n x. According to the (1.222), the hyperplane of separation of the two decision regions is given by N  wT x = 0 or wi xi + w0 = 0 (1.227) i=1

where the vector of the input signals x and that of the weights w are augmented, respectively, with x0 = 1 and w0 = −θ to include the bias20 of the level of excitement. Figure 1.33b shows the geometrical interpretation of the perceptron used to determine the hyperplane of separation, in this case a straight line, to classify twodimensional patterns in two classes. In essence, the N input stimuli to the neuron are interpreted as the coordinates of a N -dimensional pattern projected in the Euclidean space and the synaptic weights w0 , w1 , . . ., w N , including the bias, are seen as the coefficients of the hyperplane equation we denote with g(x). A generic pattern x is classified as follows: ⎧ ⎪ N ⎨> 0 =⇒ x ∈ ω1  g(x) = wi xi + w0 < 0 =⇒ x ∈ ω2 (1.228) ⎪ ⎩ i=1 = 0 =⇒ x ∈ hyperplane In particular, with reference to the (1.222), we have a pattern x ∈ ω1 (the region 1 in the figure) if the neuron is activated, i.e., it assumes the state y = 1, otherwise it belongs to the class ω2 (the region 2 ) if the neuron is passive being in the state y = 0. The perceptron thus modeled can classify patterns into two classes and are referred to as linearly separable patterns. Considering some properties of the vector algebra and the proposed neuron model in terms of stimulus vectors x and weight vector w, we observe that the excitation level is the inner product of these augmented vectors, that is, ξ = w T x. According to the (1.227), we can rewrite in the equivalent 20 In

this context, the bias is seen as a constant that makes the perceptron more flexible. It has a function analogous to the constant b of a linear function y = ax + b that representing a line geometrically allows to position the line not necessarily passing from the origin (0, 0). In the context of the perceptron, it allows a more flexible displacement of the line to adapt the prediction with the optimal data.

94

1 Object Recognition

vector form: w T x = w0 x0

(1.229)

where signal and weight vectors do not include bias. The (1.229) in this form is useful in observing that, if w and w0 x0 are constant, this implies that the projection xw of the vector x on w is constant since21 : xw =

w0 x0

w

(1.230)

We also observe (see Fig. 1.33b) that w determines the orientation of the decision plan (1.228) to its orthogonal (in the space of augmented patterns) for the (1.227), while the bias w0 determines the location of the decision surface. The (1.228) also informs us that in 2D space, all the x patterns on the separation line of the two classes have the same projection xw . It follows that, a generic pattern x can be classified, in essence, by evaluating if the excitation level of the neuron is greater or less than the threshold value, as follows:  0 x0 =⇒ x ∈ ω1 > w w

(1.231) xw w0 x0 < w =⇒ x ∈ ω2

21 By

definition, the scalar or inner product between two vectors x and w belonging to a vector space R N is a symmetric bilinear form that associates these vectors to a scalar in the real number field R, indicated in analytic geometry with: = w · x = (w, x) =

N 

wi xi

i=1

In matrix notation, considering the product among matrices, where w and x are seen as matrices N × 1, the formal scalar product is written wT x =

N 

wi xi

i=1

The (convex) angle θ between the two vectors in any Euclidean space is given by θ = arccos

wT x |w||x|

from which a useful geometric interpretation can be derived, namely to find the orthogonal projection of one vector on the other (without calculating the angle θ), for example, considering that xw = |x| cos θ is the length of the orthogonal projection of x over w (or vice versa calculate wx ), this projection is obtained considering that wT x = |w| · |x| cos θ = |w| · xw , from which we have xw =

wT x |w|

1.10 Method Based on Neural Networks Fig. 1.35 Geometrical configuration a of two classes ω1 and ω2 separable linearly and b configuration of nonlinear classes not separable with the perceptron

95

(a)

(b)

Ω

1.10.3.1 Learning with the Perceptron Recall that the objective of the perceptron is to classify a set of patterns by receiving as input stimuli the features N -dimension of a generic vector pattern x and assign it correctly to one of the two classes. To do this, it is necessary to know the hyperplane of separation given by the (1.228) or to know the weight vector w. This can be done in two ways: 1. Direct calculation. Possible in the case of simple problems such as the creation of logic circuits (AND, OR, ...). 2. Calculation of synaptic weights through an iterative process. To emulate biological learning from experience, synaptic weights are adjusted to reduce the error between the output value generated by the appropriately stimulated perceptron and the correct output defined by the pattern samples (training set). The latter is the most interesting aspect if one wants to use the perceptron as a neural model for supervised classification. In this case, the single perceptron is trained (learning phase) offering as input stimuli the features xi of the sample patterns and the values y j of the belonging classes in order to calculate the synaptic weights describing a classifier with the linear separation surface between the two classes. Recall that the single perceptron can classify only linearly separable patterns (see Fig. 1.35a). It is shown that the perceptron performs the learning phase by minimizing a tuning cost function, the current value at time t of the y(t) response of the neuron, and the desired value d(t), adjusting appropriately synaptic weights during the various iterations until converging to the optimal results. Let P = {(x1 , d1 ), (x2 , d2 ), . . . , (x M , d M )} the training set consisting of M pairs of samples pattern xk = (xk0 , xk1 , . . . , xk N ), k = 1, . . . , M (augmented vectors with xi0 = 1) and the corresponding desired classes dk = {0, 1} selected by the expert. Let w = (w0 , w1 , . . . , w N ) be the augmented weight vector. Considering that xi0 = 1, w0 corresponds to the bias that will be learned instead of the constant bias θ . We will denote by w(t) the value of the weight vector at time t during the iterative process of perceptron learning. The convergence algorithm of the perceptron learning phase consists of the following phases:

96

1 Object Recognition

1. I nitially at time t = 0, the weights are initialized with small random values (or w(t) = 0). 2. For each adaptation step t = 1, 2, 3, . . ., a pair is presented (xk , dk ) of training set P. 3. Activation. The perceptron is activated by providing the vector of the features xk . Calculated its current answer yk (t) is compared to the desired output dk and then calculated the error dk − yk (t) as follows: yk (t) = σ [w(t) · xk ] = σ [w0 (t) + w1 (t)xk1 +, · · · , +w N (t)xk N ]

(1.232)

4. Adaptation of synaptic weights: w j (t + 1) = w j (t) + η[dk − yk (t)]xk j

∀j

0≤ j ≤N

(1.233)

where 0 < η ≤ 1 is the parameter that controls the degree of learning (known as learning rate). Note that the expression [dk − yk (t)] in the (1.233) indicates the discrepancy between the actual perceptron response yk (t) calculated for the input pattern xk and the desired associated output dk of this pattern. Essentially the error of the t-th output of the perceptron is determined with respect to the k-th pattern of training. Considering that dk , y(t) ∈ {0, 1} follows that (y(t) − dk ) ∈ {−1, 0, 1}. Therefore, if this error is zero, the relative weights are not changed. Alternatively, this discrepancy can take the value 1 or −1 because only binary values are considered in output. In other words, the perceptron-based classification process is optimized by iteratively adjusting the synaptic weights that minimize the error. The training parameter η very small (η ≈ 0), implies a very limited modification of the current weights that remain almost unchanged with respect to the values reached with the previous adaptations. With high values (η ≈ 1) the synaptic weights are significantly modified resulting in a high influence of the current training pattern together with the error dk − yk (t), as shown in the (1.233). In the latter case, the adaptation process is very fast. Normally if you have a good knowledge of the training set, you tend to use a fast adaptation with high values of η.

1.10.3.2 Comparison Between Statistical Classifier and Perceptron While in the statistical approach, the cost function J is evaluated starting from the statistical information, in the neural approach, it is not necessary to know the statistical information of the pattern vectors. The perceptron with single neuron operates under the conditions that the objects to be classified are linearly separable. When class distributions are Gaussian, the Bayes classifier is reduced to a linear classifier like the perceptron. When the classes are not separable (see Fig. 1.35b), the learning algorithm of the perceptron oscillates continuously for patterns that fall into the overlapping area. The statistical approach attempts to solve problems even when it has to classify patterns belonging to the overlapping zone. The Bayes classifier, assuming

1.10 Method Based on Neural Networks

97

the distribution of the Gaussian classes, controls the possible overlap of the class distributions with the statistical parameters of the covariance matrix. The learning algorithm of the perceptron, not depending on the statistical parameters of the classes, is effective when it has to classify patterns whose features are dependent on nonlinear physical phenomena and the distributions are strongly different from the Gaussian ones as assumed in the statistical approach. The perceptron learning approach is adaptive and very simple to implement, requiring only memory space for synaptic weights and thresholds.

1.10.3.3 Batch Perceptron Algorithm The perceptron described above is based on the adaptation process to find the hyperplane of separation of two classes. We now describe the perceptron convergence algorithm that calculates the synaptic vector w based on the cost function J (w). The approach considers a function that allows the application of the gradient search. Using the previous terminology, the cost function of the perceptron is defined as follows:  (−w T x) (1.234) J(w) = x∈M

where M is the set of patterns x misclassified from the perceptron using the weight vector w [18]. If all the samples were correctly classified, the set M would be empty and consequently the cost function J (w) is zero. The effectiveness of this cost function is due to its differentiation with respect to the weight vector w. In fact, differentiating the J (w) (Eq. 1.234) with respect to w, we get the gradient vector:  ∇J(w) = (−x) (1.235) x∈M

where the gradient operator results ∇=

- ∂ ∂ ∂ .T , ,..., ∂w1 ∂w2 ∂w N

(1.236)

An analytical solution to the problem would be in solving with respect to w the equation ∇J(w) = 0 but difficult to implement. To minimize the cost function, J (w) is used instead the iterative algorithm called gradient descent. In essence, this approach seeks the minimum point of J (w) that corresponds to the point where the gradient becomes zero.22

22 It should be noted that the optimization approach based on the gradient descent guarantees to find the local minimum of a function. It can also be used to search for a global minimum, randomly choosing a new starting point once a local minimum has been found, and repeating the operation many times. In general, if the number of minimums of the function is limited and the number of attempts is very high, there is a good chance of converging toward the global minimum.

98 Fig. 1.36 Perceptron learning model based on the gradient descent

1 Object Recognition

J(w)

Δ Δ Δ - J(wt)=0 wt

w

This purpose is achieved through an iterative process starting from an initial configuration of the weight vector w, after which it is modified appropriately in the direction in which the gradient decreases more quickly (see Fig. 1.36). In fact, if at the iteration t-th we have a value of the weight vector w(t) (at the first iteration the choice in the domain of w can be random), and we calculate with the (1.235) for the vector w the gradient ∇J(w(t)), then we update the weight vector by moving a small distance in the direction in which J decreases more quickly, i.e., in the opposite direction of the gradient vector (−∇J(w(t))), we have the following update rule of weights: w(t + 1) = w(t) − η∇J(w(t)) (1.237) where 0 < η ≤ 1 is still the parameter that controls the degree of learning (learning rate) that defines the entity of the modification of the weight vector. We recall the criticality highlighted earlier in the choice of η for convergence. The perceptron batch update rule based on the gradient descent, considering the (1.235), has the following form:  x (1.238) w(t + 1) = w(t) + η x∈M

The denomination batch rule is derived from the fact that the adaptation of the weights to the t-th iteration occurs with the sum, weighted with η, of all the samples pattern M misclassified. From the geometrical point of view, this perceptron rule represents the sum of the algebraic distances between the hyperplane given by the weight vector w and the sample patterns M of the training set for which one has a classification error. From the (1.238), we can derive the adaptation rule (also called on-line) of the perceptron based on a single sample misclassified xM given by w(t + 1) = w(t) + η · xM

(1.239)

1.10 Method Based on Neural Networks

(a)

99

(b)

w(t+1)=w(t)+ηxM

(c)

w(t+1)

w(t+1)

xM w(t)

ηxM

xM

w(t)

xt

w(t)

xM

xt

Fig. 1.37 Geometric interpretation of the perceptron learning model based on the gradient descent. a In the example, a pattern xM misclassified from the current weight w(t) is on the wrong side of the dividing line (that is, w(t)T xM < 0), with the addition to the current vector of ηxM , the weight vector moves the decision line in the appropriate direction to have a correct classification of the pattern xM . b Effect of the learning parameter η; large values, after adapting to the new weight w(t + 1), can misclassify a previous pattern, denoted by x(t), which was instead correctly classified. c Conversely, small values of η are likely to still leave misclassified xM

From Fig. 1.37a, we can observe the effect of weight adaptation considering the single sample xM misclassified by the weight vector w(t) as it turns out w(t)T xM ≤ 0.23 Applying the rule (1.239) to the weight vector w(t), that is, adding to this η·xM , we obtain the displacement of the decision hyperplane (remember that w is perpendicular to it) in the correct direction with respect to the misclassified pattern. It is useful to recall in this context the role of η. If it is too large (see Fig. 1.37b), a previous pattern xt correctly classified with w(t), after adaptation to the weight w(t), would now be classified incorrectly. If it is too small (see Fig. 1.37c), the pattern xM after the adaptation to the weight w(t + 1) would still not be classified correctly. Applying the rule to the single sample, once the training set samples are augmented and normalized, these are processed individually in sequence and if at the iteration t for the j-th sample we have w T x j < 0, i.e., the sample is misclassified we perform the adaptation of the weight vector with the (1.239), otherwise we leave the weight vector unchanged and go to the next (j +1)-th sample. For a constant value of η predefined, if the classes are linearly separable, the perceptron converges to a correct solution both with the batch rule (1.238) and with the single sample (1.239). The iterative process of adaptation can be blocked to

this context, the pattern vectors of the training set P , besides being augmented (x0 = 1), are also nor mali zed, that is, all the patterns belonging to the class ω2 are placed with their negative vector: ∀ x j ∈ ω2 x j = −x j

23 In

It follows that a sample is classified incorrectly if: wT x j =

N  k=0

wk j xk j < 0.

100

1 Object Recognition

limit processing times or for infinite oscillations in the case of nonlinearly separable classes. The arrest can take place by predicting a maximum number of iterations or by imposing a minimum threshold to the cost function J(w(t)), while being aware of not being sure of the quality of the generalization. In addition, with reference to the Robbins–Monro algorithm [22], convergence can be analyzed by imposing an adaptation also for the learning rate η(t) starting from an initial value and then decreasing over time in relation a η(t) = η0 /t, where η0 is a constant and t the current iteration. This type of classifier can also be tested for nonlinearly separable classes although convergence toward an optimal solution is not ensured with the procedure of adaptation of the weights that oscillates in an attempt to minimize the error despite using the trick to also update the parameter of learning.

1.10.4 Linear Discriminant Functions and Learning In this section, we will describe some classifiers based on discriminating functions whose characteristic parameters are directly estimated by the training sets of the observed patterns. In the Sects. 1.6.1 and 1.6.2, we have already introduced similar linear and generalized discriminant functions. The discriminating functions will be derived in a nonparametric way, i.e., no assumption will be made on the knowledge of density both in the analytic and parametric form as done in the Bayesian context (see Sect. 1.8). In essence, we will continue the approach, previously used for the per ceptr on, based on minimizing a cost function but using alternative criteria to that of the perceptron: Minimum squared error (MSE), Widrow–Hoff gradient descent, Ho–Kashyap Method. With these approaches, particular attention is given to convergence both in computational terms and to the different ways of minimizing the cost function based on the gradient-descent.

1.10.4.1 Algorithm of Minimum Square Error—MSE The perceptron rule finds the weight vector w and classify the patterns xi satisfying the inequality w T xi > 0 and considering only the misclassified samples that do not satisfy this inequality to update the weights and converge toward the minimum error. The MSE algorithm instead tries to find a solution to the following set of equations: w T xi = bi

(1.240)

where xi = (xi0 , xi1 , . . . , xid ), i = 1, . . . , N are the augmented vectors of the training set (TS), w = (w0 , w1 , . . . , wd ) is the augmented weight vector to be found as a solution to the system of linear equations (1.240), and bi , i = 1, . . . , N are the arbitrary specified positive constants (also called margins). In essence, we have converted the problem of finding the solution to a set of linear inequalities with the more classical problem of finding the solution to a system of

1.10 Method Based on Neural Networks

101

Fig. 1.38 Geometric interpretation of the MSE algorithm: calculates the normalized distance with respect to the weight vector of the pattern vectors from the class separation hyperplane

g( x

)= 0

g(x) wT xj /||w|| xj xk

w

linear equations. Moreover, with MSE algorithm, all TS patterns are considered simultaneously, not just misclassified patterns. From the geometrical point of view, the MSE algorithm with w T xi = bi proposes to calculate for each sample xi the distance bi from the hyperplane, normalized with respect to |w| (see Fig. 1.38). The matrix compact form of the (1.240) is given by ⎛ ⎞⎛ ⎞ ⎛ ⎞ b1 x10 x11 · · · x1N w0 ⎜ x20 x21 · · · x2N ⎟ ⎜ w1 ⎟ ⎜ b2 ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ (1.241) ⎜ .. .. . . . ⎟ ⎜ . ⎟ = ⎜ . ⎟ ⇐⇒ Xw = b ⎝ . . .. ⎠ ⎝ .. ⎠ ⎝ .. ⎠ . '

xN0 xN1 · · · xN N wN () * ' () * N ×(d+1)

(d+1)×1

bN ' () * N ×1

The goal is now to solve the system of linear equations (1.241). If the number of equations N is equal to the number of unknowns, i.e., the number of augmented features d + 1, we have the exact formal solution: w = X−1 b

(1.242)

where it needs X to be non-singular. Normally, X results in a rectangular array with many more rows (samples) than the columns. It follows that when the number of equations exceeds the number of unknowns (N  d + 1), the unknown vector w is overdetermined, and an exact solution cannot be found. We can, however, look for a weight vector w which minimizes some error function ε between the model Xw and the desired vector b: ε = Xw − b

(1.243)

One approach is to try to minimize the module of the error vector, but this corresponds to minimizing the sum function of the squared error: N  (w T xi − bi )2 JMSE (w) = Xw − b = 2

(1.244)

i=1

The minimization of the (1.244) can be solved by analytically calculating the gradient and setting it to zero, differently from what is done with the perceptron. From the

102

1 Object Recognition

gradient calculation, we get  d dJMSE = (w T xi − bi )2 dw dw N

∇ JMSE (w) =

i=1

=

N 

2(w T xi − bi )

i=1

=

N 

d (w T xi − bi ) dw

(1.245)

2(w T xi − bi )xi

i=1

= 2XT (Xw − b) from which setting equal to zero, we have 2XT (Xw − b) = 0

=⇒

XT Xw = XT b

(1.246)

In this way instead of solving the system Xw = b, we set it to solve the Eq. (1.246) with the advantage that XT X is a square matrix of (d + 1) × (d + 1) and is often non-singular. Under these conditions, we can solve the (1.246) only with respect to w obtaining the solution sought MSE: w = (XT X)−1 XT b = X† b where

(1.247)

X† = (XT X)−1 XT

is known as the pseudo-inverse matrix of X. We observe the following property: X† X = ((XT X)−1 XT )X = (XT X)−1 (XT X) = I where I is the identity matrix and the matrix X† is inverse on the left (the inverse on the right in general is XX† = I). Furthermore, it is observed that if X is square and nonsingular, the pseudo-inverse coincides with the normal inverse matrix X† = X−1 . Like all regression problems, the solution can be conditioned by the uncertainty of the initial data which then propagates on the error committed on the final result. If the training data are very correlated, the XT X matrix could become almost singular and therefore not admit its inverse preventing the use of the (1.247). This type of ill-conditioning can be approached with the linear regularization method, also known as ridge regression. The ridge estimator is defined in this way [23]: (1.248) wλ = ((XT X) + λId)−1 XT b where λ (0 < λ < 1) is a nonnegative constant called shrinkage parameter that controls the contraction level of the identity matrix. The choice of λ is made on the basis of the correlation level of the features, i.e., the existing multicollinearity, trying

1.10 Method Based on Neural Networks

103

to guarantee an appropriate balance between the variance and the distortion of the estimator. For λ = 0, the ridge regression (1.248) coincides with the pseudo-inverse solution. Normally, the proper choice of λ is found through a cross-validation approach. A graphical exploration that represents the components of w in relation to the values of λ is useful when analyzing the curves (traces of the ridge regressions) that tend to stabilize for acceptable values of λ. The MSE solution also depends on the initial value of the margin vector b which conditions the expected results of w∗ . The arbitrary choice of positive values of b can give an MSE solution with a discriminant function that separates linearly separable (even if not guaranteed) and non-separable classes. For b = 1, the MSE solution becomes identical to the Fischer linear discriminant solution. If the number of samples tends to infinity, the MSE solution approximates the discriminating function of Bayes g(x) = p(ω1 /x) − p(ω2 /x).

1.10.4.2 Widrow–Hoff Learning The sum function of the squared error JMSE (w), Eq. (1.244), can be minimized using the gradient descent procedure. On the one hand, there is the advantage of avoiding the singularity conditions of the matrix XT X and on the other, we avoid working with large matrices. Assuming an arbitrary initial value of w(1) and considering the gradient equation (1.245), the weight vector update rule results w(t + 1) = w(t) − η(t)XT (Xw(t) − b)

(1.249)

It is shown that if η(t) = η(1)/t, with arbitrary positive value of η(1), this rule generates the weight vector w(t) which converges to the solution MSE with weightw such that XT (Xw − b) = 0. Although the memory required for this update rule has been reduced considering the dimensions (d + 1) × (d + 1) of the XT X matrix with respect to the matrix X† of (d + 1) × N , using the Widrow–Hoff procedure (or Least Mean Squared rule-LMS) has a further memory reduction, considering single samples sequentially: w(t + 1) = w(t) + η(t)[bk − w(t)T xk ]xk

(1.250)

The pseudo-code of the Widrow–Hoff procedure is reported in Algorithm 4. With the perceptron rule, convergence is always guaranteed, only if the classes are linearly separable. The MSE method guarantees convergence but may not find the separation hyperplane if the classes are linearly separable (see Fig. 1.39), because it only works in minimizing squares of the distances of samples from the hyperplane associated with margin b.

104

1 Object Recognition

Algorithm 4 Widrow–Hoff algorithm 1: 2: 3: 4: 5: 6:

Initialization: w, b, threshold θ = 0.02, t ← 0, η(t) = 0.1 do t ← (t + 1) mod N w ← w + η(t)[bt − w T xt ]xt until |η(t)(bt − w T xt )xt | < θ Final results w end

Fig. 1.39 The MSE algorithm minimizes the sum of the squares of the sample pattern distances with respect to the class separation hyperplane and may not converge even if this exists as shown in the example which instead is always found by the perceptron for linearly separable classes

ω

LMS Perceptron

1.10.4.3 Ho–Kashyap Algorithm The main limitation of the MSE method is given by the non-guarantee of finding the hyperplane of separation in the case of linearly separable classes. In fact, with the MSE method, we have imposed the minimization Xw − b 2 choosing the arbitrary and constant margin vector b. Whether or not MSE converges to find the hyperplane of separation depends precisely on how the margin vector is chosen. In the hypothesis that the two classes are linearly separable, there must exist two vectors w∗ and b∗ such that Xw∗ = b∗ > 0, where we can assume that the samples are normalized (i.e., x ← (−x) ∀x ∈ ω2 ). If we arbitrarily choose b, in the MSE method, we would have no guarantee of finding the optimal solution w∗ . Therefore, if b were known, the MSE method would be used with the pseudoinverse matrix for the calculation of the weight vector w = X† b. Since b∗ is not known, the strategy is to find both w and b. This is feasible using an alternative learning algorithm for linear discriminant functions, known as the Ho–Kashyap method which is based on the JMSE functional to be minimized with respect to w and b, given by (1.251) JMSE (w, b) = Xw − b 2 In essence, this algorithm is implemented in three steps: 1. Find the optimal value of b through the gradient descent. 2. Calculate the weight vector w with the MSE solution. 3. Repeat the previous steps until convergence.

1.10 Method Based on Neural Networks

105

To make the first step, the gradient ∇b JMSE is calculated of the functional MSE (1.244) with respect to the margin vector b, given by ∇b JMSE (w, b) = −2(Xw − b)

(1.252)

which suggests a possible update rule for b. Since b is subject to the constraint b > 0, we start from this condition and following the gradient descent, we prevent from reducing any component of the vector b to negative values. In other words, the gradient descent does not move in any direction but is always forced in a learner’s way to move in the direction that b remains positive. This is achieved through the following rule of adaptation of the margin vector (t is the iteration index): b(t + 1) = b(t) − η(t)∇b JMSE (w, b) = b(t) + 2η(t)(Xw − b)

(1.253)

and setting to zero all the positive components of ∇b JMSE or equivalently, keeping the positive components in the second term of the last expression. Choosing the first option, the adaptation rule (1.253) for b is given by b(t + 1) = b(t) − η(t)

. 1 ∇b JMSE (w(t), b(t)) − ∇b JMSE (w(t), b(t)) (1.254) 2

where | • | indicates a vector to which we apply the absolute value to all of its components. Remember that η indicates the learning parameter. Summing up, the equations used for the Ho–Kashyap algorithm are (1.252), for the calculation of the gradient ∇b JMSE (w, b), the (1.254) that is the adaptation rule to find the margin vector b fixed the weight vector w, and the (1.247) to minimize the gradient ∇w JMSE (w, b) with respect to the weight vector w that we rewrite w(t) = X† b(t)

(1.255)

where X† = (XT X)−1 XT is the pseudo-inverse matrix of X. Remember that with the solution MSE, the gradient with respect to the weight vector is zeroed, that is     ∇a JMSE = 2XT (Xw − b) = 2XT X X† b − b = 0 At this point, we can get the Ho–Kashyap algorithm with the following adaptation equations for the iterative calculation of both margin and weight vectors: b(t + 1) = b(t) + 2ηε + (t)

;

w(t) = X† b(t) t = 1, 2, . . .

adaptation equations of Ho–Kashyap

(1.256)

where ε+ indicates the positive part of the error vector: ε + (t) =

. 1ε(t) + |ε(t)| 2

(1.257)

106

1 Object Recognition

and the error vector remembering the (1.252) results   1 ε(t) = Xw(t) − b(t) = − ∇b JMSE w(t), b(t) 2

(1.258)

The complete algorithm of Ho–Kashyap is reported in Algorithm 5. Algorithm 5 Ho–Kashyap algorithm 1: Initialization: w, b > 0, 0 < η(.) < 1, t = 0, thresholds bmin , tmax 2: do t ← (t + 1) mod N 3: ε ← Xw − b 4: ε+ ← 1/2(ε + abs(ε)) 5: b ← a + 2η(t)ε+ 6: w ← X† b(t) 7: i f abs(ε) ≤ bmin then w and b are the solution found exit 8: until t = tmax 9: no solution reached 10: end

If the two classes are linearly separable, the Ho–Kashyap algorithm always produces a solution reaching the condition ε(t) = 0 and freezes (otherwise it continues the iteration if some components of the error vector are positive). In the case of nonseparable classes, it occurs that ε(t) will have only negative components proving the condition of non-separable classes. It is not possible to know after how many iterations this condition of non-separability is encountered. The pseudo-inverse matrix is calculated only once depending only on the samples of the training set. Considering the high number of iterations required to limit the computational load, the algorithm can be terminated by defining a maximum number of iterations or by setting a minimum threshold for the error vector.

1.10.4.4 MSE Algorithm Extension for Multiple Classes A multiclass classifier based on linear discriminant functions is known as linear machine. Data K classes to be separated K linear discriminant functions (LDF) are required: (1.259) gk (x) = wkT x + wk0 k = 1, . . . , K A x pattern is assigned to the class ωk if gk (x) ≥ g j (x)

∀ j = k

(1.260)

With the (1.260), the feature space is partitioned into K regions (see Fig. 1.40). The discriminant function k-th with the largest value gk (x) assigns the pattern x under consideration to the region k . In the case of equality, we can consider the unclassified pattern (it is considered to be on the separation hyperplane).

1.10 Method Based on Neural Networks

(a)

107

(b)

?

?

(c)

(d)

Fig. 1.40 Classifier based on the MSE algorithm in the multiclass context. a and b show the ambiguous regions when used binary classifier to separate the 3 classes. c and d instead show the correct classification of a multiclasse MSE classifier that uses a number of discriminant functions gk (x) up to the maximum number of classes

A multiclass classifier based on linear discriminated functions can be realized in different ways considering also the binary classifiers described above (perceptron, ...). One approach would be to use K − 1 discriminant functions, each of which separates a ωk class from all remaining classes (see Fig. 1.40a). This approach has ambiguous (undefined) areas in the feature space. A second approach uses K (K −1)/2 discriminant functions g jk (x) each of which separates two classes ωk , ω j with respect to the others. Also in this case, we would have ambiguous (undefined) areas in the feature space (see Fig. 1.40b). These problems are avoidable by defining K LDF functions given by the (1.259), i.e., with the linear machine approach (see Fig. 1.40c). If k and  j are contiguous regions, their decision boundary is represented by a portion of the hyperplane H jk defined as follows: g j (x) = gk (x)

−→

(w j − wk )T x + (w j0 − wk0 ) = 0

(1.261)

108

1 Object Recognition

From the (1.261), it follows that the difference of the vectors weight w j − wk is normal to the hyperplane H jk and that the distance of a pattern w from H jk is given by (w j − wk )/ w j − wk

It follows that with the linear machine, the difference of the vectors is important and not the vectors themselves. Furthermore not all K (K − 1)/2 region pairs must be contiguous and a lower number of separation hyperplanes may be required (see Fig. 1.40d). A multiclass classifier can be implemented as a direct extension of the MSE approach used for two classes based on the pseudo-inverse matrix. In this case, the matrix N × (d + 1) of the training set X = {X1 , . . . , X K } can be organized partitioning the lines so that it contains the patterns ordered by the K classes, that is, all the samples associated to a class ωk are contained in the submatrix Xk . Likewise the weight matrix is constructed W = [w1 , w2 , . . . , w K ] of size (d + 1) × K . Finally, the margin matrix B = [B1 , B2 , . . . , B N ] of size N × K partitioned in submatrix B j (like X) whose elements are zero except those in the j-th column that are set to 1. In essence, the problem is set as K MSE solutions in the generalized form: XW = B (1.262) The objective function is given by J (A) =

K 

Xwi − bi 2

(1.263)

i=1

where J (A) is minimized using the pseudo-inverse matrix:  −1 XB W = X† B = X T X

(1.264)

1.10.4.5 Summary A binary classifier based on the perceptron always finds the hyperplane of separation of the two classes only if these are linearly separable otherwise it oscillates without ever converging. Convergence can be controlled by adequately updating the learning parameter but there is no guarantee on the convergence point. A binary classifier that uses the MSE method converges for classes that can be separated and cannot be separated linearly, but in some cases, it may not find the hyperplane of separation for separable linear classes. The solution with the pseudo-inverse matrix is used if the sample matrix XT X is non-singular and not too large. Alternatively, the Widrow– Hoff algorithm can be used. In other paragraphs, we will describe how to develop a multiclass classifier based on multilayer perceptrons able to classify nonlinearly separable patterns.

1.11 Neural Networks

109

1.11 Neural Networks In Sects. 1.10.1 and 1.10.2 we described, respectively, the biological and mathematical model of a neuron, explored in the 1940s by McCulloch and Pitts, with the aim of verifying the computationality of a network made up of simple neurons. A first application of a neural network was the per ceptr on described previously for binary classification and applied to solve logical functions. An artificial neural network (ANN) consists of simple neurons connected to each other in such a way that the output of each neuron serves as an input to many neurons in a similar way as the axon terminals of a biological neuron are connected via synaptic connections with dendrites of other neurons. The number of neurons and the way in which they are connected (topology) determines the ar chitectur e of a neural network. After the perceptron, in 1959, Bernard Widrow and Hoff Marciano of Stanford University developed the first neural network models (based on the Least Mean Squares—LMS algorithm) to solve a real problem. These models are known as ADALINE (ADAptive LInear NEuron) and MADALINE (multilayer network of ADALINE units) realized, respectively, to eliminate echo in telephone lines and for pattern recognition. Research on neural networks went through a period of darkness in the 1970s after the Per ceptr ons book of 1969 (by M. Minsky and S. Pappert) that questioned the ability of neural models, limited to solving only linearly separable functions. This involved the limited availability of funds in this revitalized sector only in the early 1980s when Hopfield [24] demonstrated, through a mathematical analysis, what could and could not be achieved through neural networks (he introduced the concepts of bidirectional connections between neurons and associative memory). Subsequently, research on neural networks took off intensely with the contribution of various researchers from whom the proposed neural network models were named: Grosseberg–Carpenter for the ART—Adaptive Resonance Theory network; Kohonen for the SOM—Self Organization Map; Y. Lecunn, D. Parker, and Rumelhart–Hinton– Williams who independently proposed the learning algorithm known as Backpropagation for an ANN network; Barto, Sutton, and Anderson for incremental learning based on the Reinforcement Learning, ... While it is understandable how to organize the topology of a neural network, years of research have been necessary to model the computational aspects of state change and the aspects of adaptation (configuration change). In essence, neural networks have been developed only gradually defining the modalities of interconnection between neurons, their dynamics (how their state changes), and how to model the process of adaptation of synaptic weights. All this in the context of a neural network created by many interconnected neurons.

1.11.1 Multilayer Perceptron—MLP A linear machine that implements linear discriminant functions with the minimum error approach, in general, does not sufficiently solve the requirements required

110

1 Object Recognition

Input Layer

Output Layer

Target

Hidden Layer 1/2(z

1/2(z

tk xi

wji

zk yj wkj netj

zK yNh

1

bias

function

tK

xd

bias

1/2(zk-tk

netk

J(w)

Σ Objective

1/2(zK-tK

1

Fig. 1.41 Notations and symbols used to represent a MLP neural network with three layers and its extension for the calculation of the objective function

by the various applications. In theory, the solution of finding suitable nonlinear functions could be a solution but we know how complex the appropriate choice of such functions is. A multilayered neural network that has the ability to learn from a training set regardless of the linearity or nonlinearity of the data can be the solution to the problems. A neural network created with MultiLayer Perceptrons—MLP is a feedforward network of simple process units with at least one hidden layer of neurons. The process units are similar to the perceptron, except for the threshold function placed by a nonlinear differentiable which guarantees the calculation of the gradient. The feedforward network (acyclic) defines the type of architecture that defines the way in which neurons are organized by layer s and how neurons are connected from one layer to another. In particular, all the neurons of a layer are connected with all the neurons of the next layer, that is, with only forward connections (feedforward). Figure 1.41 shows a 3-layer feedforward MLP network: d-dimension input layer (input neurons), intermediate layer of hidden neurons; and the layer of output neurons. The input and output neurons, respectively, represent the r eceptor s and the e f f ector s and with their connections the channels through which the respective signals and information are propagated are in fact created. In the mathematical model of the neural network, these channels are known as paths. The propagation of signals and the process of information through these paths of a neural network is achieved by modifying the state of neurons along these paths. The states of all neurons realize the overall state of the neural network and the synaptic weights associated with all connections give the neural network configuration. Each path in a f eed f or war d MLP

1.11 Neural Networks

111

network leads from the input layer to the output layer through individual neurons contained in each layer. The ability of an NN-Neural Network to process information depends on the inter-connectivity and the states of neurons that change and the synaptic weights that are updated through an adaptation process that represents the learning activity of the network starting from the samples of the training set. This last aspect, i.e., the network update mode is controlled by the equations or rules that determine the dynamics and functionality over time of the NN. The computational dynamics specifies the initial state of an NN and the update rule over time, once the configuration and topology of the network itself has been defined. An NN f eed f or war d is characterized by a time-independent data flow (static system) where the output of each neuron depends only on the current input in the manner specified by the activation function. The adaptation dynamic specifies the initial configuration of the network and the method of updating weights over time. Normally, the initial state of synaptic weights is assigned with random values. The goal of the adaptation is to achieve a network configuration such that the synaptic weights realize the desired function from the input data (training pattern) provided. This type of adaptation is called supervised learning. In other words, it is the expert that provides the network with the input samples and the desired output values and with the learning occurs how much the network response agrees with the desired target value known a priori. A supervised feedforward NN is normally used as a function approximator. This is done with different learning models, for example, the backpr opagation which we will describe later.

1.11.2 Multilayer Neural Network for Classification Let us now see in detail how a supervised MLP network can be used for the classification in K classes of d-dimensional patterns. With reference to Fig. 1.41, we describe the various components of an MLP network following the flow of data from the input layer to the output layer. (a) Input layer. With supervised learning, each sample pattern x = (x1 , . . . , xd ) is presented to the network input layer. (b) Intermediate layer. The neuron j-th of the middle layer (hidden) calculates the activation value net j obtained from the inner product between the input vector x and the vector of synaptic weights coming from the first layer of the network: net j =

d 

w ji xi + w j0

(1.265)

i=1

where the pattern and weight vectors are augmented to include the fictitious input component x0 = 1, respectively.

112

1 Object Recognition

(c) Activation function for hidden neurons. The j-th neuron of the intermediate layer emits an output signal y j through the nonlinear activation function σ , given by  1 if net j ≥ 0 (1.266) y j = σ (net j ) = −1 if net j < 0 (d) Output layer. Each output neuron k calculates the activation value netk obtained with the inner product between the vector y (the output of the hidden neurons) and the vector wk of the synaptic weights from the intermediate layer: netk =

Nh 

wk j y j + wk0

(1.267)

j=1

where Nh is the number of neurons in the intermediate layer. In this case, the weight vector wk is augmented by considering a neuron bias which produces a constant output y0 = 1. (e) Activation function for output neurons. The neuron k-th of the output layer emits an output signal z k through the non-linear activation function σ , given by  1 if netk ≥ 0 z k = σ (netk ) = (1.268) −1 if netk < 0 The output z k for each output neuron can be considered as a direct function of an input pattern x through f eed f or war d operations of the network. Furthermore, we can consider the entire f eed f or war d process associated with a discriminant function gk (x) capable of separating a class (of the K classes) represented by the k-th output neuron. This discriminating function is obtained by combining the last 4 equations as follows: gk (x) = z k = σ

 Nh j=1

'

wk j σ

 d i=1

  w ji xi + w j0 + wk0 ()

(1.269)

*

activation of the k-th output neuron

where the internal expression (•) instead represents the activation of the j-th hidden neuron net j given by the (1.265). The activation function σ (net) must be continuous and differentiable. It can also be different in different layers or even different for each neuron. The (1.269) represents a category of discriminant functions that can be implemented by a three-layer MLP network starting from the samples of the training set {x1 , x2 , . . . , x N } belonging to K classes. The goal now is to find the network learning paradigm to get the synaptic weights wk j and w ji that describe the functions gk (x) for all K classes.

1.11 Neural Networks

113

1.11.3 Backpropagation Algorithm It is shown that an MLP network, with three layers, an adequate number of nodes per layer, and appropriate nonlinear activation functions, is sufficient to generate discriminating functions capable of separating classes also nonlinearly separable in a supervised context. The backpr opagation algorithm is one of the simplest and most general methods for supervised learning in an MLP network. The theory on the one hand demonstrates that it is possible to implement any continuous function from the training set through an MLP network but from a practical point of view, it does not give explicit indications on the network configuration in terms of number of layers and necessary neurons. The network has two operating modes: Feedforward or testing which consists of presenting a pattern to the input layer and the information processed by each neuron propagates forward through the network producing a result in the output neurons. Supervised learning which consists of presenting a pattern in input and modifying (adapting) the synaptic weights of the network to produce a result very close to the desired one (the target value).

1.11.3.1 Supervised Learning Learning with the backpropagation involves initially executing, for a network to be trained, the phase of feedforward calculating and memorizing the outputs of all the neurons (of all the layers). The values of the output neurons z k are compared to the desired values tk and an error function (objective function) is used to evaluate this difference for each output neuron and for each training set sample. The evaluation of the overall error of the feedforward phase is a scalar value that depends on the current values of all the W weights of the network that during the learning phase must be adequately updated in order to minimize the error function. This is achieved by evaluating with the SSE method (the Sum of the Squared Error) of all the output neurons related to each training set sample, and minimizing the following objective function J (W) with respect to the weights of the network: J (W) =

K N   1 n=1 k=1

2

(tnk − z nk )2 =

N  1 n=1

2

tn − zn 2

(1.270)

where N indicates the number of samples in the training set, K is the number of neurons in the output layer (coinciding with the number of classes), and the factor of 1/2 is included to cancel the contribution of the exponent with the differentiation, such as we will see forward. The backpropagation learning rule is based on the gradient descent. Once the weights are initialized with random values, their adaptation to the t-th iteration occurs in the direction that will reduce the error:

114

1 Object Recognition

w(t + 1) = w(t) + w = w(t) − η

∂ J (W) ∂w

(1.271)

where η is the learning parameter that establishes the extent of the weight change. The (1.271) ensures that the objective function (1.270) is minimized and never becomes negative. The learning rule guarantees that the adaptation process converges once all input samples of the training set are input. Now let’s look at the essential steps of supervised learning based on the backpr opagation. The data of the problem is the samples xk of the training set, the output of the MLP network and the desired target value tk . The unknowns are the weights related to all the layers to be updated with the (1.271) for which we should determine the w adaptation with the gradient descent: ∂ J (W) (1.272) w = −η ∂w for each net weight (weights are updated in the opposite direction to the gradient). For simplicity, we will consider the objective function (1.270) for a single sample (N = 1): K  1 1 (tk − z k )2 = t − z 2 (1.273) J (W) = 2 2 k=1

The essential steps of the backpropagation algorithm are: 1. Feedforward calculation. The sample (x1 , . . . , xd ) is presented to the input layer of the MLP network. For each neuron j-th of the hidden layer, the activation value net j is calculated with the (1.265) and the output y j of this neuron is calculated and stored with the activation function σ , or the Eq. (1.266). Similarly, the activation value netk is calculated and stored with the (1.267) and the output of the output neuron z k with the activation function σ the Eq. (1.268). For each neuron, the values of the derivatives of the activation functions σ  are stored (see Fig. 1.42). 2. Backpropagation in the output layer. At this point, we begin to compute the first set of the partial derivatives ∂ J/∂wk j of the error with respect to the weights wk j between hidden and output neurons. This is done with the rule of derivation of composite functions (chain rule) since the error does not depend exclusively on wk j . By applying this rule, we obtain ∂J ∂ J ∂z k ∂netk = ∂wk j ∂z k ∂netk ∂wk j

(1.274)

Let us now calculate each partial derivative of the three terms of the (1.274) separately.

1.11 Neural Networks

115 k-th Output neuron

xi

wji

zk wkj

yj

k

J-th h. neuron: Calculation Backpropagation Error

δj=[Σ n=1 h

n

nj

](1-y )y j

j

J(w)

1/2(zk-tk

netk

netj

Nh

Σ

tk

J-th hidden neuron

k-th o. neuron: Calculation Backpropagation Error k

k

-tk)(1-zk)zk

Backpropagation

Fig. 1.42 Reverse path to the feed-forward one, shown in Fig. 1.41, during the learning phase for the backward propagation of the error of backpropagation δko for the output neuron k-th and the error of backpr opagation δ hj associated with the hidden neuron j-th

The first term considering the (1.273) results:  K  ∂J ∂ 1 = (z m − tm )2 = (z k − tk ) ∂z k ∂z k 2

(1.275)

m=1

The second term, considering the activation value of the output neuron k-th given by the (1.267) and the corresponding output signal z k given by its nonlinear activation function σ Eq. (1.268), results ∂z k ∂ = σ (netk ) = σ  (netk ) ∂netk ∂netk

(1.276)

The activation function is generally nonlinear and commonly the sigmoid function24 given by the (1.225) is chosen which by replacing in the (1.276), we obtain   1 ∂z k ∂ = ∂netk ∂netk 1 + exp(−netk ) (1.277) exp(−netk ) = = (1 − z k )z k (1 + exp(−netk ))2

24 The

sigmoid or sigmoid curve function (in the shape of an S) is often used as a transfer function in neural networks considering its nonlinearity and easy differentiability. In fact, the derivative is given by   1 dσ (x) d = σ (x)(1 − σ (x)) = ∂x dx 1 + exp(−x) and is easily implementable.

116

1 Object Recognition

The third term, considering the activation value netk given by the (1.267), results  Nh  ∂netk ∂  = wkn yn = y j ∂wk j ∂wk j

(1.278)

n=1

From the (1.278), we observe that only one element in the sum netk (that is of the inner product between the output vector y of the hidden neurons and the weight vector wk of the output neuron) depends on wk j . Combining the results obtained for the three terms, (1.275), (1.277), and (1.278), we get ∂J = (z k − tk )(1 − z k )z k y j = δk y j (1.279) ' () * ∂wk j δk

where we assume y j = 1 for the bias weights or for j = 0. The expression indicated with δk defines the error of backpr opagation. It is highlighted that in the (1.279), the weight wk j is the variable entity while its input y j is a constant. 3. Backpropagation in the hidden layer. The partial derivatives of the objective function J (W) with respect to the weights w ji between input and hidden neurons must now be calculated. Applying the chain rule again gives ∂ J ∂ y j ∂net j ∂J = ∂w ji ∂ y j ∂net j ∂w ji

(1.280)

In this case, the first term ∂ J/∂ y j of the (1.280) cannot be determined directly because we do not have a desired value t j to compare with the output y j of a hidden neuron. The error signal must instead be recursively inherited from the error signal of the neurons to which this hidden neuron is connected. For the MLP in question, the derivative of the error function must consider the backpropagation of the error of all the output neurons. In the case of multilayer MLP, reference would be made to the neurons of the next layer. Thus, the derivative of the error on the output y j of the j-th hidden neuron is obtained by considering the errors propagated backward by the output neurons:  ∂ J ∂z n ∂net o ∂J n = ∂yj ∂z n ∂netno ∂ y j K

(1.281)

n=1

The first two terms in the summation of the (1.281) have already been calculated, respectively, with the (1.275) and (1.277) in the previous step and their product corresponds to the backpropagation error δk associated to the k-th output neuron: ∂ J ∂z n = (z n − tn )(1 − z n )z n = δno ∂z n ∂netno

(1.282)

1.11 Neural Networks

117

where here the propagation error is explicitly reported with the upper apex “o” to indicate the association to the output neuron. The third term in the summation of the (1.281) is given by  Nh  ∂  ∂netno = wns ys = wno j ∂yj ∂yj

(1.283)

s=1

From the (1.283), it is observed that only one element in the sum netn (that is of the inner product between the output vector y of the hidden neurons and the weight vector wno of output neuron) depends on y j . Combining the results of the derivatives (1.282) and (1.283), the derivative of the error on the output y j of the j-th hidden neuron given by the (1.281), becomes  ∂ J ∂z n ∂net o ∂J n = ∂yj ∂z n ∂netno ∂ y j K

n=1

=

K  n=1

(z n − tn )(1 − z n )z n wn j = ' () * δno

K 

(1.284) δno wn j

n=1

From the (1.284), we highlight how the error propagates backward on the j-th hidden neuron accumulating the error signals coming backward from all the K neurons of output to which it is connected (see Fig. 1.42). Moreover, this backpropagation error is weighed by the connection force of the hidden neuron with all the output neurons. Returning to the (1.280), the second ∂ y j /∂net j and the third ∂net j /∂w ji term are calculated in a similar way to those of the data output layer from the Eqs. (1.277) and (1.278) which in this case are ∂yj = (1 − y j )y j ∂net j

(1.285)

∂net j = xi ∂w ji

(1.286)

The final result of the partial derivatives ∂ J/∂w ji of the objective function, with respect to the weights of the hidden neurons, is obtained by combining the results of the single derivatives (1.284) and of the last two equations, as follows:   K ∂J o = δn wn j (1 − y j )y j xi = δ hj xi ∂w ji n=1 ' () * δ hj

(1.287)

118

1 Object Recognition

where δ hj indicates the backpropagated error related to the j-th hidden neuron. Recall that the weight of the associated input value is where delta hj indicates the retropropated error related to the j-th hidden neuron. Recall that for the bias weight, the associated input value results xi = 1. 4. Weights update. Once all the partial derivatives are calculated, all the weights of the MLP network are updated in the direction of the negative gradient with the (1.271) and considering the (1.279). For the weights of the neurons hidden → out put wk j , we have wk j (t + 1) = wk j (t) − η

∂J = wk j (t) − ηδko y j ∂wk j

k = 0, 1, . . . , K ; j = 1, . . . , N h

(1.288)

remembering that for j = 0 (the bias weight), we assume y j = 1. For neuron weights input → hidden w ji , we have w ji (t + 1) = w ji (t) − η

∂J = w ji (t) − ηδ oj xi ∂w ji

i = 0, 1, . . . , d; j = 1, . . . , N h

(1.289)

Let us now analyze the weight update Eqs. (1.279) and (1.287) and see how they affect the network learning process. The gradient descent procedure is conditioned by the initial values of the weights. From the above equations, it is normal to randomly set the initial weight value. The update amount of the k-th output neuron is proportional to (z k − tk ). It follows that no update occurs when the output of the neuron and the desired value coincide. The sigmoid activation function is always positive and controls the output of neurons. According to the (1.279) y j and (z k − tk ) concur, based on their own sign, to modify adequately (decrease or increase) the weight value. It can be verified that a pattern presented to the network produces no signal (y j = 0) and this implies no update on the weights.

1.11.4 Learning Mode with Backpropagation The learning methods concern how to present the samples of the training set and how to update the weights. Three are the most common methods: 1. Online. Each sample is presented only once and the weights are updated after the presentation of the sample (see Algorithm 6). 2. Stochastic. The samples are randomly chosen from the training set and the weights are updated after the presentation of each sample (see Algorithm 7). 3. Batch Backpropagation. Also called off-line, the weights are updated after the presentation of all the samples of the training set. The variations of the weights for each sample are stored and the update of the weights takes place only when all the samples have been presented only once. In fact, the objective function of batch learning is the (1.270) and its derivative is the sum of the derivatives for

1.11 Neural Networks

119

Algorithm 6 Online Backpropagation algorithm 1: 2: 3: 4: 5: 6: 7:

Initialize: w, Nh , η, convergence criterion θ, n ← 0 do n ← n + 1 xn ← choose sequentially an patter n w ji ← w ji + ηδ hj xi ; wk j ← wk j + ηδko y j until ∇ j (w < θ return w end

Algorithm 7 Stochastic Backpropagation algorithm 1: 2: 3: 4: 5: 6: 7:

Initialize: w, Nh , η, convergence criterion θ, n ← 0 do n ← n + 1 xn ← choose a patter n randomly wk j ← wk j + ηδko y j w ji ← w ji + ηδ hj xi ; until ∇ j (w < θ return w end

each sample:

  K N 1 ∂  ∂ 2 J (W) = (tnk − z nk ) ∂w 2 ∂w n=1

(1.290)

k=1

where the partial derivatives of the expression [•] have been calculated previously and are those related to the objective function of the single sample (see Algorithm 8). Algorithm 8 Batch Backpropagation algorithm 1: Initialize: w, Nh , η, convergence criterion θ, epoch ← 0 2: do epoch ← epoch + 1 3: m ← 0; w ji ← 0 wk j ← 0 4: do m ←m+1 5: xm ← select a sample 6: w ji ← w ji + ηδ hj xi ; wk j ← wk j + ηδko y j 7: until m = n 8: w ji ← w ji + w ji ; wk j ← wk j + wk j ; 9: until ∇ j (w < θ 10: return w 11: end

From the experimental analysis, the stochastic method is faster than the batch even if the latter fully uses the direction of the gradient descent to converge. Online

120

1 Object Recognition

training is used when the number of samples is very large but is sensitive to the order in which the samples of the training set are presented.

1.11.5 Generalization of the MLP Network

Fig. 1.43 Learning curves related to the three operational contexts of the MLP network: training, validation, and testing

Mean Squared Error

An MLP network is able to approximate any nonlinear function if the training set of samples (input data/desired output data) presented are adequate. Let us now see what is the level of generalization of the network, that is, the ability to recognize a pattern not presented in the training phase and not very different from the sample patterns. The learning dynamics of the network is such that at the beginning, the error on the samples is very high and proceeds to decrease asymptotically tending to a value that depends on: the Bayesian error of the samples, the size of the training set, the network configuration (number of neurons and layers), and initial value of the weights. A graphical representation of the learning dynamics (see Fig. 1.43) is obtained by reporting on the ordinates how the error varies with respect to the number of realized epochs. From the obtained learning curve, you can decide the level of training and stop it. Normally, the learning is blocked when the imposed error is reached or when an asymptotic value is reached. A situation of saturation in learning can occur, in the sense that an attempt is made to excessively approximate the training data (for example, many samples are presented) generating the phenomenon of overfitting (in this context over training) with the consequent loss of the generalization of the network when one then enters the test context. A strategy to control the adequacy of the level of learning achieved is to use test samples, other than the training samples, and validate the generalization behavior of the network. On the basis of the results obtained, it is also possible to reconfigure the network in terms of number of nodes. A strategy that allows an appropriate configuration of the network (and avoid the problem of overfitting) is that of having a third set of samples called validation. The dynamics of learning are analyzed with the two curves that are obtained from the training set and the one related to the

Validation Testing Training Point of Early Stopping

Epochs

1.11 Neural Networks

121

validation set (see Fig. 1.43). From the comparison of the curves, you can decide to block the learning at the minimum local met on the validation curve.

1.11.6 Heuristics to Improve Backpropagation A neural network like the MLP we have seen that it is based on mathematical/computing fundamentals inspired by biological neural networks. The backpropagation algorithm used for supervised learning is set up as an error minimization problem associated with training set patterns. In these conditions it is shown that the convergence of the backpropagation is possible, both in probabilistic and indeterministic terms. Although this, it is useful in real applications to introduce heuristics aimed at optimizing the implementation of an MLP network, in particular for the aspects of classification and pattern recognition.

1.11.6.1 Dynamic Learning Improvement: Momentum We know that the learning factor η controls the behavior of the backpropagation algorithm, in the sense that for small values, there is a slow convergence and ensures better effectiveness, while for large values, you can have an unstable network. Another aspect to consider concerns the typical local minimum problem with the gradient descent method (see Fig. 1.44). This implies that a sub-optimal solution is reached. In these contexts, it would be useful to modify the weight adaptation rule. One solution derived from physics is the use of momentum. Objects in motion tend to stay in motion unless external actions intervene to change this situation. In our context, we should dynamically vary the learning factor as a function of the variation of the previous partial derivatives (the dynamics of the system in this case is altered by the gradient of the error function). In particular, this factor should be increased where the variation of the partial derivatives is almost constant, and decrease it where the value of the partial derivative undergoes a change. This results in modifying the weight adaptation rule by including some fraction of the previously updated weight changes (the idea is to link the updating of the current

Fig. 1.44 The problem of the backpropagation algorithm which, based on the gradient descent, finds a local minimum in minimizing the error function

J(w)

Global Minimum Local Minimum

w

122

1 Object Recognition

weights taking into account the past iterations). Let w(t) = w(t) − w(t − 1) the variation of the weights at the iteration t-th, the adaptation rule of the weights (for example, considering the 1.289) is modified as follows:   ∂J + αw(t − 1) (1.291) w(t + 1) = w(t) + (1 − α) η ∂w where α (also called momentum) is a positive number with values between 0 and 1, and the expression [•] is the variation of the weights associated with the gradient descent, as expected for the rule of backpr opagation. In essence, α parameter determines the amount of influence from the previous iterations over the current one. The momentum introduces a sort of damping on the dynamic of adaptation of the weights avoiding oscillations in the irregular areas of the surface of the error function averaging the components of the gradient with opposite sign and speeding up the convergence in the flat areas. This attempts to prevent the search process from being blocked on a local minimum. For α = 0, we have the gradient descent rule, for α = 1, the gradient descent is ignored and weights are updated in constant variation. Normally, α = 0.9 is used.

1.11.6.2 Properties of the Activation Function The backpropagation algorithm accepts any type of activation function σ (•) if it is derivable. Nevertheless it is convenient to choose functions with the following properties: (a) Nonlinear, to ensure nonlinear decision boundaries. (b) Saturated, allow a minimum and maximum output value. This allows you to maintain the weights and activation potentials in a limited range as well as the training time.  (c) Continuous and differentiable, that is, σ (•) and σ (•) are defined in the whole input range. The derivative is important to derive the weight adaptation rule. Backpropagation can accept piecewise linear activation functions even if it adds complexity and few benefits. (d) Monotonicity, to avoid the introduction of local minima. (e) Linearity, for small values of net, so the network can implement linear models according to the type of data. (f) Antisymmetry, to guarantee a faster learning phase. An antisymmetric function (σ (−x) = −σ (−x)) is the hyperpbolic tangent (1.226). The activation function that satisfies all the properties described above is the following sigmoid function:   b·net − e−b·net e (1.292) σ (net) = a · tanh(b · net) = a b·net e + e−b·net

1.11 Neural Networks

123

with the following optimal values for a = 1.716 and b = 2/3, and the linear interval −1 < net < 1.

1.11.6.3 Preprocessing of Input Data Training set data must be adequately processed ensuring that the calculated average is zero or in any case with small values compared to variance. In essence, the input data must be normalized to homogenize the variability range. Possibly with the normalization, the data is transformed (xn = (x − μx )/σx ) to result with zero mean and unit variance. This data normalization is not required for online backpropagation where the entire training set is not processed simultaneously. Avoid presenting patterns with highly correlated or redundant features to the network. In these cases, it is convenient to apply a transform to the principal components to verify the level of their correlation. When the network is used for classification, the desired target value that identifies a class must be compatible with the range of definition of the activation function. For any finite value of net, the output σ (net) must never reach the saturation values of the sigmoid function (±1.716), and so there would be no error. Conversely, if the error would be great the algorithm would never converge and the weights would tend with values toward infinity. One solution is to use target vectors of type t = (−1, −1, 1)T where 1 indicates the class and −1 indicates nonclass membership (in the example, the target vector represents class ω3 in the 3-class classification problem).

1.11.6.4 Network Configuration: Neurons and Weights The number of neurons in the input and output layers are imposed by the dimensionality of the problem: size of the pattern features and number of classes. The number of hidden neurons Nh instead characterizes the potential of the network itself. For very small values of Nh , the network may be insufficient to learn complex decision functions (or approximate generic nonlinear functions). For very large Nh , we could present the problem of the over f itting of the training set and lose the generalization of the network to the presentation of new patterns. Although in literature there are several suggestions for choosing Nh , the problem remains unresolved. Normally each MLP network is configured with Nh neurons after several tests for a given application. For example, we start with a small number and then gradually increase it or, on the contrary, we start with a large number and then it is decreased. Another rule can be to correlate Nh with the number of samples N of the training set choosing, for example, Nh = N /10. The initial value of the weights must be different from zero otherwise the learning process does not start. A correct approach is to initialize them with small random values. Those relating to the H-O (Hidden-Output) connection must be larger than the I-H (Input-Hidden) connections, since they must carry back the propagation error. Very small values for the weights H-O involve a very small variation of the weights

124

1 Object Recognition

in the hidden layer with the consequent slowing of the learning process. According to the interval of definition of the sigmoid function, the heuristic used for the choice in the interval . of the layer I-H is that of a uniform - random distribution . -of the weights 1 1 1 1 − √ , √ while for the layer H-O, it results − √ N , √ N , where d indicates the d d h h size of the samples. The backpropagation algorithm is applicable for MLP with more than one hidden layer. The increase in the number of hidden layers does not improve the approximation power of any function. A 3-layer MLP is sufficient. From the experimental analysis, for some applications, it was observed that the configuration of an MLP with more than 3 layers presents a faster learning phase with the use of a smaller number of hidden neurons altogether. However, there is a greater predisposition of the network to the problem of the local minimum.

1.11.6.5 Learning Speed, Stop Criterion, and Weight Decay Backpropagation theory does not establish an exact criterion to block the learning phase of the MLP. We have also seen the problem of overfitting when the network is not properly blocked. One approach may be to monitor the error function J (W) during the training with the validation set. The method of early stopping can be used to prevent overfitting by monitoring the mean square error (MSE) of the MLP on the validation set during the learning phase. In essence, the training is blocked on the minimum of the curve of the validation set (see Fig. 1.43). In general, like all algorithms based on the gradient descent, the backpropagation depends on the learning parameter η. This essentially indicates the speed of learning that for small values (starting from 0.1) ensures convergence (that is, we reach a minimum of the error function J (W)) although it does not guarantee the generalization of the network. A possible heuristic is to dynamically adapt η in relation to the current value of the J function during the gradient descent. If J fluctuates, η may be too large and should be decreased. If instead J decreases very slowly, η is too small and should be increased. Another heuristic that can avoid the problem of overfitting is to keep the weight values small. There is no theoretical motivation to justify that weight decay should always lead to an improvement in network performance (in fact, there may be sporadic cases where it leads to performance degradation), although experimentally it is observed that in most cases, it is useful. Weight reduction is performed after each update as follows: w(new) = w(old)(1 − )

0 0. It follows that we can increase the movement s of j − k (to the right), according to the (1.323), which has the effect of aligning the x[k] character with the discordant character in the text T (see Fig. 1.67b). 3. k > j: is the configuration where the discordant character T [s + j] is present in the pattern x, but to the right of the position j of the discordant character in x, resulting j − k < 0, which would imply a negative shift, to the left, and therefore, the increment of s is ignored and not applied with the (1.323) but we could only increase s by 1 character to the right (see Fig. 1.67c).

1.19 String Recognition Methods

177

(a)

1 2 ...... s+j

Text T

A G G G C G G A C C G C

s=0

Pattern x

n

j

scansion k=0

G G T A A G G A

(b)

1 2 ......

Text T

A G G G C G G A C C G C

(c)

A G ......

j

G G T A A G G A

s=11+(j-k)=15

s+j

k0

G G T A A G G A

A G G G C G G A C C G C

Pattern x

T A G A A T

k

s=11

n

s+j

1 2 ......

Text T

A G ......

G G T A A G G A

s=0+j=5

Pattern x

T A G A A T

A G ......

k

k>j

G G T A A G G A

s=8+(j-k)=6 Proposal ignored

G G T A A G G A

j-k=6-8 j. In the example, j = 6 and k = 8 and the heuristic proposing a negative shift is ignored

In the Boyer–Moore algorithm, the heuristic of the discordant character is realized with the function of the last occurrence: λ : {σ1 , σ2 , . . . , σ|V | } → {0, 1, . . . , m} given as follows:  λ[σi ] =

max{k : 1 ≤ k ≤ m and x[k] = σi λ[σi ] = 0

if σi ∈ x otherwise

(1.325)

where σi is the i-th symbol of the alphabet V. The function of the last occurrence defines λ[σi ] as the pointer of the rightmost position (i.e., of the last occurrence) in x where the character σi appears, for all the characters of the alphabet V. The pointer is zero if σi does not appear in x. The pseudo-code that implements the algorithm of the last occurrence function is given below (Algorithm 16).

178

1 Object Recognition

(a)

1 2 ......

Text T

A

T

T G T

s+j G A

C T

A G T C A C A A

s=10

C A T G A C A A C A G G

j

Pattern x

A A

G A C A A C A

s=10+m=10+9=19

(b)

A

T

T G T

k=0 A A

1 2 ......

Text T

n ......

G A C A A C A

s+j G A

C T

s=10

A G T C A C A A

j

k

ε

Pattern x

s=10+(m-|α|)=10+(9-2)=17

C A

α

C A T G A C A A C A G G

n ......

β

G A C A A C A C A

k=0 G A C A A C A

α

(c)

1 2 ......

Text T

A

T

T G T

s+j G A

C T

s=7

Pattern x

s=7+3

A G T C A C A A

n ......

j

k A T

C A T G A C A A C A G G

G A C A A C A A T

k>0

G A C A A C A

Fig. 1.68 The different configurations of the good suffix heuristic (string with a green background). a k does not exist. In the pattern x, no prefix is needed that is also suffixed to T [s + j + 1 : s + m]. A pattern shift equal to its length m is proposed. b k does not exist but a prefix α is needed (in the example α = “C A”) of x which is also suffix of T [s + j + 1 : s + m] indicated with β. A pattern shift is proposed to match its α prefix with the text suffix β (s increment of 7 characters). c k exists. In the pattern, a substring is needed (in the example, it is “AC A” orange colored) coinciding with the suffix that occurs in T [s + j + 1 : s + m] satisfying the condition that x[k] = x[ j]. It is proposed, as in (b), the shift of the pattern to align the substring found in x with the suffix of the text indicated above (in the example, the increment is 3 characters)

1.19.2.2 Good Suffix Heuristics The strategy of this heuristic is to analyze, as soon as the discordant character is found, the presence of any identical substrings between text and pattern. Figure 1.68a shows a possible configuration while this heuristic operates as soon as a discordant character is detected. We observe the presence of the substring suffixed in the pattern that coincides with the identical substring in the text (in the figure they have the same color). Having found the identical substring, between text and pattern, the heuristic of the good suffix proposes to verify if in the pattern there exist other substrings identical to the suffix and in the affirmative case, it suggests to move the pattern to the right to align the substring found in the pattern with that of the good suffix in the text (see figure). In formal terms, using the previous meaning of the symbols, we have the discordant character between text T [s + j] and pattern x[ j], with s the current shift and j indicates the position in x of the discordant character. The suffix x[ j + 1 : m] is the same as the substring in the text T [s + j + 1 : s + m]. We want to find, if there is in x, a copy of the suffix at the rightmost position k < j, such that x[k + 1 : k + m − j] = T [s + j + 1 : s + m]

x[k] = x[ j]

(1.326)

1.19 String Recognition Methods

179

Algorithm 16 Pseudo-code of the algorithm function of the last occurrence: Last_Occurence(x, V) 1: Input: pattern x, m length of x, Alphabet V 2: Output: λ 3: for ∀σ ∈ V do 4:

λ[σ ] ← 0

5: end for 6: for j ← 1 to m do 7:

λ[x[ j]] ← j

8: end for 9: return λ

The first expression indicates the existence of a copy of the good suffix starting from the position k + 1, while the second one indicates that this copy is preceded by a character different from the one that caused the discordance or x[ j]. Having met the (1.326), the heuristic suggests updating the s shift as follows: s ← s + ( j − k)

j = 1, . . . , m

k = 1, . . . , m

(1.327)

and to move the pattern x to the new position s + 1. Comparing the characters of x from the position k to k + m − j is useless. Figure 1.68 schematizes the functionality of the good suffix heuristic for the 3 possible configurations as the index k varies. 1. k = 0: since there is no copy in x of the suffix of T [s + j + 1 : s + m], move x of m characters (see Fig. 1.68a). 2. k = 0 but there is a prefix α o f x: the prefix α exists and if it is also coincident with a suffix β of T [s + j + 1 : s + m], the heuristic suggests a shift m − |α|, such as to match the prefix α with the suffix β of T [s + j + 1 : s + m] (see Fig. 1.68b). 3. k exists: since there is a copy in x of the good suffix (starting from position k +1), preceded by a character different from the one that caused the discordance, moves x of the minimum number of characters to align this copy found in x with the coincident suffix of T . In other words, there is another substring T [s + j + 1 : s + m] in x, that is, x[k + 1 : k + m − j] with x[k] = x[ j] (see Fig. 1.68c). This situation is attributable to the previous case when this copy in x is just a prefix of x. In fact, as shown in Fig. 1.68b, we have that copy (the substring α = x[1, |α|] = “CA”) matches with a prefix of x preceded by the null character  (imagined in the position k = 0 in x[0]) which satisfies the condition x[k] = x[ j], or  = x[ j]. This prefix coincides with the suffix T [s +m −|α|+1 : s +m] and the proposed shift always consists in aligning the prefix of the pattern with the suffix of the text.

180

1 Object Recognition

In the Boyer–Moore algorithm, the good suffix heuristic is realized with the good suffix function γ [ j] which defines, once found in position j, j < m the discordant character x[ j] = x[s + j], the minimum amount of increment of the movement s, given as follows: γ [ j] = m − max{k : 0 ≤ k < m and x[ j + 1 : m] ∼ x[1 : k] with x[k] = x[ j]

(1.328)

where the symbol ∼ indicates a relationship of similarity between two strings.35 In this context, we have that x[ j + 1 : m] ⊃ x[1 : k] or x[1 : k] ⊃ x[ j + 1 : m]. The function γ [ j] with the (1.328) determines a minimum value to increase the shift s without causing any character in the good suffix T [s + j + 1 : s + m] is discordant with the proposed new pattern alignment (see also the implication in the Note 35). The pseudo code that implements the algorithm of the last occurrence function is given in Algorithm 17. Algorithm 17 Pseudo-code of the good suffix algorithm: Good_Suffix(x, m) 1: 2: 3: 4: 5: 6: 7:

Input: pattern x, m length of x Output: γ π ← Func_Pr e f i x(x) x ← Rever se(x) π  ← Func_Pr e f i x(x ) for j ← 0 to m do γ [ j] ← m − π [m]

8: end for 9: for k ← 1 to m do 10: 11:

j ← m − π  [k] if (γ [ j] > (k − π  [k])) then γ [ j] ← k − π  [k]

12: 13:

end if

14: end for 15: return γ

35 Let α

and β be two strings, we define a similarity relation α ∼ β (we read α is similar to β), with the meaning that α ⊃ β (where we recall that the symbol ⊃ has the meaning of suffix). It follows that, if two strings are similar , we can align them with their identical characters further to the right, and no pair of aligned characters will be discordant. The similarity relation ∼ is symmetric, that is, α ∼ β if and only if α ∼ β. It is also shown that the following implication is had: α⊃β

and

y ⊃ β =⇒ α ∼ y.

1.19 String Recognition Methods

181

From the pseudo-code, we observe the presence of the function prefix π applied to the pattern x and its inverse indicated with x . This function is used in the preprocessing of the string-matching algorithm of Knuth–Morris–Pratt and is formalized as follows: given a pattern x[1 : m], the function prefixed for x is the function π : {1, 2, . . . , m} → {0, 1, 2, . . . , m − 1} such that π [q] = max{k : k < q

and

x[1 : k] ⊃ x[1 : q]}

(1.329)

In essence, the (1.329) indicates that π [q] is the length of the longest prefix of the pattern x and is also suffix of x[1 : q]. Returning to the good suffix algorithm (Algorithm 17), the first f or − loop calculates the vector γ with the difference between the length of the pattern x and the values returned by the prefix function π . With the second f or −loop, having already initialized the vector γ , the latter is updated with the values of π  for any shifts less than those calculated with π in the initialization. The pseudo-code of the function prefix π given by the (1.329) is given in Algorithm 18. Algorithm 18 Pseudo-code of the prefix function algorithm: Func_Prefix(x, m) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:

Input: pattern x, m length of x Output: π π [1] ← 0 k←0 for i ← 2 to m do while k > 0 and x[k + 1] = x[i] do k ← π [k] if (x[k + 1] = x[i]) then k ←k+1 end if π [i] ← k end while

13: end for 14: return π

We are now able to report the Boyer–Moore algorithm having defined, the two preprocessing functions, that of the discordant character (Last_Occurrence) and of the good suffix (Good_Suffix). The pseudo-code is given in Algorithm 19. Figure 1.69 shows a simple example of the Boyer–Moore algorithm that finds the first occurrence of the pattern x[1:9] = “CTAGCGGCT” in the text T[1:28] = “CTTATAGCTGATCGCGGCCTAGCGGCTAA” after 6 steps, having previously

182

1 Object Recognition

Algorithm 19 Pseudo-code of Boyer–Moore’s string-matching algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

Input: Text T , pattern x, n length of T and m length of x, Alphabet V Output: Find the positions s of the occurrences of x in T λ ← Last_Occurr ence(x, V ) γ ← Good_Su f f i x(x, m) s←0 while s ≤ n − m do j ←m while j > 0

and

x[ j] = T [s + j] do

j ← j −1 if j = 0 then print “pattern x appears in position", s s ← s + γ [0] else s ← s + max(γ [ j], j − λ[T [s + j]])

15:

end if

16:

end while

17: end while

pre-calculated the tables of the two heuristics, appropriate for the alphabet V = {A, C, G, T } and the pattern x considered. From the analysis of the Boyer–Moore algorithm, we can observe the similarity with the simple algorithm Algorithm 15 from which it differs substantially, for the comparison between pattern and text, which occurs from right to left, and from the use of the two heuristics to evaluate the shifts of the pattern by more than 1 character. In fact, while in the simple algorithm of string matching, the shift s is always 1 character, with the Boyer–Moore algorithm, when the discordant character is determined the instruction associated with line 14 is executed to increase s of a quantity that corresponds to the maximum of the values suggested by the two heuristic functions. The computational complexity [47,48] of the Boyer–Moore algorithm is O(nm). In particular, the one related to the preprocessing, due to the Last_ Occurrence function is O(m + |V|) and the Good_Su f f i x function is O(m), while the one due to the search phase is O(n − m + 1)m. The comparisons between strings saved in the search phase depend very much on heuristics that learn useful information about the internal structure of the pattern or text. To operate in linear times, the two heuristics are implemented through tables containing the entire alphabet and the symbols of the pattern string whose positions of their occurrence more to the right

1.19 String Recognition Methods 1 2 ......

T: Step 1

Step 2

s=0

T

n

s+j

T A T

A G

C T

G A T

kG=0

C T

A G C G G C T

T:

C

T A T

T

A G

C T

G A T

T

T A T

A G

A G C G G C T

C T

G A T

T

T A T

A G

C T

G A T

C G C

T

T A T

A G

C T

C G C

kG=0

Gs=1

kL=8

Lo=j-kL=9-8=1

G G C C T A G C G G C T A A

kG=0 T

Gs=m-|α|=9-2=7

T A T

A G

C T

C T

G A T

come passo 3

G G C C T A G C G G C T A A

j=7

α

x: C

Lo=j-kL=9-7=2

G G C C T A G C G G C T A A

A G C G G C T

G A T

s=10+1

T:

kL=7

j=9 C T

C

Gs=1

A G C G G C T

s=9+1

T:

Step 6

C G C

kG=0

j=9 C T

C

Lo=j-kL=6-3=3

G G C C T A G C G G C T A A

C T

x: Step 5

C G C

s=7+2

T:

kL=3

kL=7 j=9

x: C

G G C C T A G C G G C T A A

Gs=m-|α|=9-2=7

s=0+7

x: Step 4

C G C

j=6

α

x:

T: Step 3

C

183

A G C G G C T

C G C

kL=8

Lo=j-kL=7-8=-1

G G C C T A G C G G C T A A

s=11+7

x:

C T

A G C G G C T

Fig. 1.69 Complete example of the Boyer–Moore algorithm. In step 1, we have character discor dance for j = 6 and match between su f f i x “CT” (of the good suffix T[7:9] = “GCT”) and pr e f i x α = x[1:2] = “CT”; between the two heuristics the good suffix (Gs = 7) wins over the one of the discordant character (Lo = 3) moving the 7-character pattern as shown in the figure. In step 2, the heuristic wins having found the rightmost discordant character in the pattern and proposes a shift of Lo = 2 greater than the heuristic of the good suffix Gs = 1. In steps 3 and 4, both heuristics suggest moving 1 character. Step 5 instead chooses the heuristics of the good suffix Gs = 7 (configuration identical to step 1) while the other heuristic that proposes a negative shift Lo = −1 is ignored. In step 6, we have the first occurrence of the pattern in the text

are pre-calculated in x and of the rightmost positions of the occurrence of the suffixes in x. From experimental results, the Boyer–Moore algorithm is performing for large lengths of the x pattern and with large alphabets V. To better optimize computational complexity, several variants of the Boyer–Moore algorithm have been developed [47,49,50].

1.19.3 Edit Distance In the preceding paragraphs, we have addressed the problem of exact matching between strings. In the more general problem of pattern classification, we used the nearest-neighbor algorithm to determine the class to which the patterns belong. Even for a string pattern, it is useful to determine a category by evaluating a level of

184

1 Object Recognition

similarity with strings grouped in classes according to some categorization criteria. In the context of string recognition, we define a concept of distance known as the edit distance, also known as the Levenshtein distance [51], to evaluate how many character operations are needed to transform a string x in another string y. This measure can be used to compare strings and express concepts of diversity in particular in Computational Biology applications. The possible operations to evaluate the edit distance are Substitution: Put in x the corresponding character in y. For example, turning mario into maria requires replacing “o with “a  . Insertion: A character from y is inserted in x with the consequent increment of 1 character of the length of x. For example, turning the string “mari” into “maria” requires the insertion of “a”. Deletion: A character is deleted in x by decreasing its length by one character. These are the elementary operations to calculate the distance of edit. Several elementary operations can be considered as the transposition that interexchanges adjacent characters of a string x. For example, transforming x = “marai” into y = “maria” requires a single transposition operation equivalent to 2 substitution operations. The edit distance between the strings “dived” and “davide” is equal to 3 elementary operations: 2 substitution or “dived” → “daved” (substitution of “i” with “a”) and “dived” → “david” (substitution of “e” with “i”), and 1 insertion “dived” → “davide” (the character “e”). The edit distance is the minimum number of operations required to make two strings identical. Edit distance can be calculated by giving different costs to each elementary operation. For simplicity, we will consider unit costs for each edit elementary operation. Given two strings x = (x1 , x2 . . . , xi−1 , xi , xi+1 , . . . , xn ) y = (y1 , y2 . . . , y j−1 , y j , y j+1 , . . . , ym ), it is possible to define the matrix D(i, j) as the edit distance of the prefix strings (x1 ..xi ) and (y1 ..y j ) and consequently obtain as final result the edit distance D(n, m) between the two strings x and y, as the minimum number of edit operations to transform the entire string x in y. The calculation of D(i, j) can be set recursively, based on immediately shorter prefixes, considering that you can have only three cases associated with the related edit operations: 1. Substitution, the xi character is replaced with the y j and we will have the following edit distance: D(i, j) = D(i − 1, j − 1) + cs (i, j)

(1.330)

1.19 String Recognition Methods

185

where D(i − 1, j − 1) is the edit distance between the prefixes (x1 ..xi−1 ) and (y1 ..y j−1 ), and cs (i, j) indicates the cost of the substitution operation between the characters xi and y j , given by  1 if xi = y j cs (i, j) = (1.331) 0 if xi = y j 2. Deletion, the character xi is deleted and we will have the following edit distance: D(i, j) = D(i − 1, j) + cd (i, j)

(1.332)

where D(i − 1, j) is the edit distance between the prefixes (x1 ..xi−1 ) and (y1 ..y j ), and cd (i, j) indicates the cost of the deletion operation, normally set equal to 1. 3. I nserimento, il carattere y j viene inserito ed avremo la seguente edit distance: D(i, j) = D(i, j − 1) + cin (i, j)

(1.333)

where D(i, j −1) is the edit distance between the prefixes (x1 ..xi−1 ) and (y1 ..y j ), and cin (i, j) indicates the cost of the insertion operation, normally set equal to 1. Given that there are no other cases, and that we are interested in the minimum value, the correct edit distance recursively defined is given by D(i, j) = min{D(i −1, j)+1, D(i, j −1)+1, D(i −1, j −1)+cs (i, j)} (1.334) with strictly positive i and j. The edit distance D(n, m) between two length strings of n and m characters, respectively, can be calculated with a recursive procedure that implements the (1.334) starting from the basic conditions: D(i, 0) = i

i = 1, . . . , n

D(0, j) = j

j = 1, . . . , m

D(0, 0) = 0

(1.335)

where D(i, 0) is the edit distance between the prefix string (x1 ..xi ) and the null string , D(0, j) is the edit distance between the null string  and the prefix string (y1 ..y j ) and D(0, 0) represents the edit distance between null strings. A recursive procedure based on the (1.334) and (1.335) would be inefficient requiring considerable computation time O[(n + 1)(m + 1)]. The strategy used, instead, is based on dynamic programming (see algorithm Algorithm 20). With this algorithm, a cost matrix D is used to calculate the edit distance starting from the basic conditions given by the (1.335), and then, using these basic values (lines 3–12), calculate for each element D(i, j) (i.e., for pairs of strings more and more long at the variation of i and j) the edit distance (minimum cost, line 20) with the (1.334) thus filling the matrix up to the element D(n, m) representing the edit distance of the strings x and y of length respectively of n and m. In essence, the Algorithm 20 instead of directly calculating the distance D(n, m) of the two

186

1 Object Recognition

strings of interest, the strategy is to determine the distance of all the prefixes of the two strings (reduction of the problem in sub problems) from which to derive, for induction, the distance for the entire length of the strings. Algorithm 20 Pseudo-code of the edit distance algorithm 1: 2: 3: 4: 5:

Input: String x, n length of x, String y and m length of y Output: Matrix of edit distances D(n, m) D(i, j) ← 0 for i ← 1 to n do D(i, 0) ← i

6: end for 7: for j ← 1 to m do 8:

D(0, j) ← j

9: end for 10: for i ← 1 to n do 11: 12: 13: 14: 15: 16: 17:

for j ← 1 to m do if x[i] = y[ j] then c←0 else c←1 end if D(i, j) = min{D(i − 1, j) + 1, D(i, j − 1) + 1, D(i − 1, j − 1) + cs } () * ' () * ' () * ' Deletion xi

18:

I nser tion y j

subst/nosubst xi with y j

end for

19: end for 20: return D(n, m)

The symmetry properties are maintained if the edit operations (insertion and deletion) have identical costs as in this case is reported in the algorithm. Furthermore, we can consider c(i, j) if we wanted to differentiate the elementary costs of editing between the character xi with the character yi . Figure 1.70a shows the schema of the matrix D with dimensions (n + 1 × m + 1) for the calculation of the minimum edit distance for strings x = “Frainzisk” and y = “Francesca”. The elements D(i, 0) and D(0, j) are filled first (respectively, the first column and the first row of D), or the base values representing the lengths of all the prefixes of the two strings with respect

1.19 String Recognition Methods

(a) ε i

x

F r a i n z i s k

(b)

y

j

D(0,0)

ε F r a n c e s c a 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 D(i,0)

187

F’

D(0,j)

D(i-1,j-1) D(i-1,j) D(i,j) D(i,j-1)

c(1,1)=0

D(i,j)=D(1,1)=min[D(1-1,1-1)+c(1,1), D(1-1,1)+1, D(1,1-1)+1]= min[D(0,0)+0, D(0,1)+1, D(1,0)+1]= min[0+0, 1+1, 1+1]=0

ε ε 0 F 1 r 2 a 3 i 4 n 5 z 6 i 7 s 8 k 9

F 1 0 1 2 3 4 5 6 7 8

r 2 1 0 1 2 3 4 5 6 7

a 3 2 1 0 1 2 3 4 5 6

n 4 3 2 1 1 1 2 3 4 5

c 5 4 3 2 2 2 2 3 4 5

e 6 5 4 3 3 3 3 3 4 5

s 7 6 5 4 4 4 4 4 3 4

c 8 7 6 5 5 5 5 5 4 4

a 9 8 7 6 6 6 6 6 5 5

Substitution

No Substitution

Insertion

Deletion

c(1,2)=1 D(i,j)=D(1,2)=min[D(1-1,2-1)+c(1,2), D(1-1,2)+1, D(1,2-1)+1]= min[D(0,1)+1, D(0,2)+1, D(1,1)+1]= min[1+1, 2+1, 0+1]=1

Fig. 1.70 Calculation of the edit distance to transform the string x = “Frainzisk” into y = “Francesca”. a Construction of the matrix D starting from the base values given from the (1.335) and then iterating with the dynamic programming method to calculate the other elements of D using the algorithm Algorithm 20. b D complete calculated by scanning the matrix from left to right and from the first line to the last. The edit distance between the two strings is equal to D(9, 9) = 5 and the requested edit operations are 1 deletion, 3 substitution, and 1 insertion

to the null string . The element D(i, j) is the edit distance between the prefix x(1..i) and y(1.. j). The value D(i, j) is calculated by induction based on the last characters of the two prefixes. If these characters are equal D(i, j) is equal to the edit distance between the two shorter prefixes of 1 character (x(1..i − 1) and y(1.. j − 1)) or D(i, j) = 0 + D(i − 1, j − 1). If the last two characters are not equal D(i, j) results in a unit greater than the minimum edit distances relative to the 3 shortest prefixes (the adjacent elements: upper, left, and upper left), that is, D(i, j) = 1+min{D(i −1, j), D(i, j −1), D(i − 1, j −1)}. It follows, as shown in the figure, that for each element of D the calculation of the edit distance depends only on the values previously calculated for the shortest prefixes of 1 character. The complete matrix is obtained by iterating the calculation, for each element of D, operating from left to right and from the first row to the last, thus obtaining the edit distance in the last element D(n, m). Figure 1.70b shows the complete D matrix for the calculation of the edit distance to transform the string x = “Frainzisk” in y = “Francesca”. The path is also reported (indicating the type of edit operation performed) which leads to the final result D(9.9) = 5, i.e., to the minimum number of required edit operations. Returning to the calculation time, the algorithm reported requires a computational load of O(nm) while in space-complexity, it requires O(n) (the space-complexity is O(nm) if the whole of the matrix is kept for a trace-back to find an optimal alignment). Ad hoc algorithms have been developed in the literature that reduce computational complexity up to O(n + m).

188

1 Object Recognition

1.19.4 String Matching with Error In several applications, where the information may be affected by error or the nature of the information itself evolves, the exact pattern matching is not useful. In these cases, it is very important to solve the problem of the approximate pattern matching which consists in finding in the text string an approximate version of the pattern string according to a predefined similarity level. In formal terms, the approximate pattern matching is defined as follows. Given a text string T of length n and a pattern x of length m with m ≤ n, the problem is to find the k approximate occurrence of the pattern string x in the text T with maximum k (0 ≤ k ≤ m) different characters (or errors). A simple version of an approximate matching algorithm Algorithm 21, shown below, is obtained with a modification of the exact matching algorithm presented in Algorithm 15. Algorithm 21 Pseudo-code of a simple approximate string-matching algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

Input: Text T , pattern x, n length of T , m length of x and number of different characters k Output: Find the positions s of the approximate occurrences of x in T s←0 for s ← 1 to n − m + 1 do count ← 0 for j ← 1 to m do if x[ j] = T [s + j − 1] then count ← count + 1 end if end for if count ≤ k then Print “approximate pattern in position”, s end if

14: end for

The first for-loop (statement line 4) iterates through the text of one character at a time while the second for-loop (statement line 6) there are counted in count label the number of different characters found between the pattern x and text T [s : s + j − 1] reporting the s positions in T of the approximate patterns found in T according to the required k-differences. We would return to the exact matching algorithm by putting in Algorithm 21 k = 0. We recall the computational inefficiency of this algorithm equal to O(nm). An efficient algorithm of approximate string matching, based on the edit distance, is reported in Algorithm 22.

1.19 String Recognition Methods

189

Algorithm 22 Pseudo-code of an approximate string-matching algorithm based on edit distance 1: 2: 3: 4: 5:

Input: pattern x, m length of x, Text T , n length of T and number of different characters k Output: Matrix D(m, n) where if D(m, j) ≤ k we have that x occurs at the position j in T D(i, j) ← 0 for i ← 1 to m do D(i, 0) ← i

6: end for 7: for j ← 1 to n do 8:

D(0, j) ← 0

9: end for 10: for i ← 1 to m do 11: 12: 13: 14: 15: 16: 17: 18:

for j ← 1 to n do if (x[i] = T [ j] then c←0 else c←1 end if D(i, j) = min{D(i − 1, j) + 1, D(i, j − 1) + 1, D(i − 1, j − 1) + cs } end for

19: end for 20: for j ← 1 to n do 21: 22: 23:

if D(m, j) ≤ k then Print “approximate pattern in position”, j end if

24: end for 25: return D(m, n)

This algorithm differs substantially in having zeroed the line D(0, j), j = 1, n instead of assigning the value j (line 8 of the Algorithm 22). In this way, the D matrix indicates that a null prefix of the pattern x corresponds to a null occurrence of the text T which does not involve any cost. Each element D(i, j) of the matrix contains

190

1 Object Recognition

T

j D(0,0)

ε i

x

C T A G

ε 0 1 2 3 4 D(i,0)

C 0 0 1 2 3

C 0 0 1 2 3

T 0 1 0 1 2

A 0 1 1 0 1

T 0 1 1 1 1

A 0 1 2 1 2

G 0 1 2 2 1

C 0 0 1 2 2

T 0 1 0 1 2

G 0 1 1 1 1

A 0 1 2 1 2

T 0 1 1 2 2

C 0 0 1 2 3

D(0,j)

10 4 5 7 1-approximate occurrences

Fig. 1.71 Detection, using the algorithm Algorithm 22, of the 1-approximate occurrences of the pattern x = “CTAG” in the text T = “CCTATAGCTGATC”. The positions of the approximate occurrences in T are in the line D(m, ∗) of the modified edit matrix, where the value of the edit distance is minimum, i.e., D(4, j) ≤ 1. In the example, the occurrences in T are 4 in the positions for j = 4, 5, 7 and 10

the minimum value k for which there exists an approximate occurrence (with at most k different characters) of the prefix x[1 : i] in T . It follows that the approximate occurrences of k of the entire pattern x are found in T in the positions shown in the last row of the matrix D(m, ∗) (line 22 of the Algorithm 22). In fact, each element D(m, j), j = 1, n reports the number of different characters (that is, number of edit operations required to transform x in the occurrence to T ) between pattern and corresponding substring T [ j : j + m − 1] of the text under examination. In fact, the positions j-th of the occurrences of x in T are found where D(m, j) ≤ k. Figure 1.71 shows the modified edit matrix D calculated with the algorithm Algorithm 22 to find the approximate occurrences of k = 1 of the pattern x = “CTAG” in the text T = “CCTATAGCTGATC”. The 1-approximate pattern x, as shown in the figure, occurs in the text T in positions j = 4, 5, 7 and 10, or where D(4, j) ≤ k. In literature [51], there are several other approaches based on different methods of calculating the distance between strings (Hamming, Episode, ...) and on dynamic programming with the aim also of reducing computational complexity in time and space.

1.19.5 String Matching with Special Symbol In several string-matching applications, it is useful to define a special " which can appear in the pattern string x and in the text T and have the meaning of equivalence (match) in comparison with any other character of the pattern or text. For example, in the search for the occurrences of the pattern x = ‘CDD"AT"G’ in the text T = ‘C"D"ATTGATTG"G...’ the special character " is not considered in the comparison and the first occurrence of the pattern is aligned at the beginning of the text. The exact string-matching algorithms, described in the previous paragraphs, can be modified to include the management of the special character involving a high degradation of computational complexity.

1.19 String Recognition Methods

191

Ad hoc solutions have been developed in the literature [52] for the optimal comparison between strings neglecting the special character.

References 1. R.B. Cattell, The description of personality: basic traits resolved into clusters. J. Abnorm. Soc. Psychol. 38, 476–506 (1943) 2. R.C. Tryon, Cluster Analysis: Correlation Profile and Orthometric (Factor) Analysis for the Isolation of Unities in Mind and Personality (Edward Brothers Inc., Ann Arbor, Michigan, 1939) 3. K. Pearson, On lines and planes of closest fit to systems of points in space. Philos. Mag. 2(11), 559–572 (1901) 4. H. Hotelling, Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417–441 and 498–520 (1933) 5. W. Rudin, Real and Complex Analysis (Mladinska Knjiga McGraw-Hill, 1970). ISBN 0-07054234-1 6. R. Larsen, R.T. Warne, Estimating confidence intervals for eigenvalues in exploratory factor analysis. Behav. Res. Methods 42, 871–876 (2010) 7. M. Friedman, A. Kandel, Introduction to Pattern Recognition: Statistical, Structural, Neural and Fuzzy Logic Approaches (World Scientific Publishing Co Pte Ltd, 1999) 8. R.A. Fisher, The statistical utilization of multiple measurements. Ann Eugen 8, 376–386 (1938) 9. K. Fukunaga, J.M. Mantock, Nonparametric discriminant analysis. IEEE Trans. Pattern Anal. Mach. Intell. 5(6), 671–678 (1983) 10. T. Okada, S. Tomita, An optimal orthonormal system for discriminant analysis. Pattern Recognit. 18, 139–144 (1985) 11. J.-S.R. Jang, C.-T. Sun, E. Mizutani, Neuro-fuzzy and Soft Computing (Prentice Hall, 1997) 12. J. MacQueen, Some methods for classification and analysis of multivariate observations, in Proceedings of the Fifth Berkeley Symposium on Mathematical statistics and Probability, vol. 1, ed. by L.M. LeCam, J. Neyman (University of California Press, 1977), pp. 282–297 13. G.H. Ball, D.J. Hall, Isodata: a method of data analysis and pattern classification. Technical report, Stanford Research Institute, Menlo Park, United States. Office of Naval Research. Information Sciences Branch (1965) 14. J.R. Jensen, Introductory Digital Image Processing: A Remote Sensing Perspective, 2nd edn. (Prentice Hall, Upper Saddle River, NJ, 1996) 15. L.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms (Plenum Press, New York, 1981) 16. C.K. Chow, On optimum recognition error and reject tradeoff. IEEE Trans. Inf. Theory 16, 41–46 (1970) 17. A.R. Webb, K.D. Copsey, Statistical Pattern Recognition, 3rd edn. (Prentice Hall, Upper Saddle River, NJ, 2011). ISBN 978-0-470-68227-2 18. R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, 2nd edn. (Wiley, 2001). ISBN 0471056693 19. K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd edn. (Academic Press Professional, Inc., 1990). ISBN 978-0-470-68227-2 20. A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39(1), 1–38 (1977) 21. W. McCulloch, W. Pitts, A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5, 115–133 (1943)

192

1 Object Recognition

22. H. Robbins, S. Monro, A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951) 23. R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. 58(1), 267–288 (1996) 24. J.J. Hopfield, Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci 79, 2554–2558 (1982) 25. J.R. Quinlan, Induction of decision trees. Mach. Learn. 1, 81–106 (1986) 26. J.R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann, San Mateo, CA, 1993) 27. L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Regression Trees (Wadsworth Books, 1984) 28. X. Lim, W.Y. Loh, X. Shih, A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach. Learn. 40, 203–228 (2000) 29. P.E. Utgoff, Incremental induction of decision trees. Mach. Learn. 4, 161–186 (1989) 30. J.R. Quinlan, R.L. Rivest, Inferring decision trees using the minimum description length principle. Inf. Comput. 80, 227–248 (1989) 31. J.R. Quinlan, Simplifying decision trees. Int. J. Man-Mach. Stud. 27, 221–234 (1987) 32. L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis (Wiley, 2009) 33. T. Zhang, R. Ramakrishnan, M. Livny, Birch: an efficient data clustering method for very large databases, in Proceedings of SIGMOD’96 (1996) 34. S. Guha, R. Rastogi, K. Shim, Rock: a robust clustering algorithm for categorical attributes, in Proceedings in ICDE’99 Sydney, Australia (1999), pp. 512–521 35. G. Karypis, E.-H. Han, V. Kumar, Chameleon: a hierarchical clustering algorithm using dynamic modeling. Computer 32, 68–75 (1999) 36. K.S. Fu, Syntactic Pattern Recognition and Applications (Prentice-Hall, Englewood Cliffs, NJ, 1982) 37. N. Chomsky, Three models for the description of language. IRE Trans. Inf. Theory 2, 113–124 (1956) 38. H. J. Zimmermann, B.R. Gaines, L.A. Zadeh, Fuzzy Sets and Decision Analysis (North Holland, Amsterdam, New York, 1984). ISBN 0444865934 39. Donald Ervin Knuth, On the translation of languages from left to right. Inf. Control 8(6), 607–639 (1965) 40. D. Marcus, Graph Theory: A Problem Oriented Approach, 1st edn. (The Mathematical Association of America, 2008). ISBN 0883857537 41. D.H. Ballard, C.M. Brown, Computer Vision (Prentice Hall, 1982). ISBN 978-0131653160 42. A. Barrero, Three models for the description of language. Pattern Recognit. 24(1), 1–8 (1991) 43. R.E. Woods, R.C. Gonzalez, Digital Image Processing, 2nd edn. (Prentice Hall, 2002). ISBN 0201180758 44. P.H. Winston, Artificial Intelligence (Addison-Wesley, 1984). ISBN 0201082594 45. D.E. Knuth, J.H. Morris, V.B. Pratt, Fast pattern matching in strings. SIAM J. Comput. 6(1), 323–350 (1977) 46. R.S. Boyer, J.S. Moore, A fast string searching algorithm. Commun. ACM 20(10), 762–772 (1977) 47. Hume and Sunday, Fast string searching. Softw. Pract. Exp. 21(11), 1221–1248 (1991) 48. T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms (MIT Press and McGraw-Hill, 2001). ISBN 0-262-03293-7 49. R. Nigel Horspool, Practical fast searching in strings. Softw. Pract. Exp., 10(6), 501–506 (1980) 50. D.M. Sunday, A very fast substring search algorithm. Commun. ACM 33(8), 132–142 (1990) 51. N. Gonzalo, A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001) 52. P. Clifford, R. Clifford, Simple deterministic wildcard matching. Inf. Process. Lett. 101(2), 53–54 (2007)

2

RBF, SOM, Hopfield, and Deep Neural Networks

2.1 Introduction We begin to describe the first three different neural network architectures: Radial Basis Functions-RBF, Self-Organizing Maps-SOM, and the Hopfield network. The Hopfield network has the ability to memorize information and recover it through partial contents of the original information. As we shall see, it presents its originality based on physical foundations that have revitalized the entire sector of neural networks. The network is associated with an energy function to be minimized during its evolution with a succession of states until it reaches a final state corresponding to the minimum of the energy function. This characteristic allows it to be used to solve and set an optimization problem in terms of objective function to be associated with an energy function. The SOM network instead has an unsupervised learning model and has the originality of autonomously grouping input data on the basis of their similarity without evaluating the convergence error with external information on the data. Useful when we have no exact knowledge of the data to classify them. It is inspired by the topology of the model of the cortex of the brain considering the connectivity of neurons and in particular the behavior of an activated neuron and the influence with neighboring neurons that reinforce the bonds with respect to those further away that become weaker. Extensions of the SOM bring it back to supervised versions, as in the Learning Vector Quantization versions SOM-LVQ1, LVQ2, etc., which essentially serve to label the classes and refine the decision edges. The RBF network uses the same neuron model as the MLP but differs in its architectural simplification of the network and of the activation function (based on the radial base function) that implements the Cover theorem. In fact, RBF provides only one hidden layer and the output layer consists of only one neuron. The MLP network is more vulnerable in the presence of noise on the data while the RBF is more robust to the noise due to the radial basis functions and to the linear nature of the combination of the output of the previous neuron (MLP instead uses the nonlinear activation function). © Springer Nature Switzerland AG 2020 A. Distante and C. Distante, Handbook of Image Processing and Computer Vision, https://doi.org/10.1007/978-3-030-42378-0_2

193

194

2 RBF, SOM, Hopfield, and Deep Neural Networks

The design of a supervised neural network can be done in a variety of ways. The backpropagation algorithm for a multilayer (supervised) network, introduced in the previous chapter, can be seen as the application of a recursive technique that is denoted by the term stochastic approximation in statistics. RBF uses a different approach in the design of a neural network as a “curve fitting” problem, that is, the resolution of an approximation problem, in a very large space, so learning is reduced to finding a surface in a multidimensional space that provides the best “fit” for training data, in which the best fit is intended to be measured statistically. Similarly, the generalization phase is equivalent to using this multidimensional surface searched with training data to interpolate test data never seen before from the network. The network is structured on three levels: input, hidden, and output. The input layer is directly connected with the environment, that is, they are directly connected with the sensory units (raw data) or with the output of a subsystem of feature extraction. The hidden layer (unique in the network) is composed of neurons in which radialbased functions are defined, hence the name of radial basis functions, and which performs a nonlinear transformation of the input data supplied to the network. These neurons form the basis for input data (vectors). The output layer is linear, which provides the network response for the presented input pattern. The reason for using a nonlinear transformation in the hidden layer followed by a linear one in the output layer is described in an article by Cover (1965) according to which a pattern classification problem reported in a much larger space (i.e., in the nonlinear transformation from the input layer to the hidden one) it is more likely to be linearly separable than in a reduced size space. From this observation derives the reason why the hidden layer is generally larger than the input one (i.e., the number of hidden neurons much greater than the cardinality of the input signal).

2.2 Cover Theorem on Pattern Separability A complex problem of automatic pattern recognition through the use of a neural network radial basis function is solved by transforming the space of the problem into a larger dimension in a nonlinear way. Cover’s theorem on the separability of the pattern is defined as follows: Theorem 1 (Cover theorem) A complex pattern recognition problem transformed into a larger nonlinear space is more likely to be linearly separable than in a lower dimensionality space, except that the space is not densely populated. Let C be a set of N patterns (vectors) x1 , x2 , …, xN , to each of which one of the two classes C 1 or C 2 is assigned. This binary partition is separable if there is a surface such that it separates the points of class C 1 from those of class C 2 . Suppose that a generic vector x ∈ C is m0 -dimensional, and that we define a set of real functions

2.2 Cover Theorem on Pattern Separability

195

{ϕi (x) |i = 1, ..., m1 } for which the input space is transformed to m0 -dimensional in a new m1 -dimensional space, as follows: T  ϕ(x) = ϕ1 (x), ϕ2 (x), . . . , ϕm1 (x)

(2.1)

The function ϕ, therefore, allows the nonlinear spatial transformation from a space to a larger one (m1 > m0 ). This function refers to the neurons of the hidden layer of the RBF network. A binary partition (dichotomy) [C1 , C2 ] of C is said to be ϕ—separable if there exists a vector w m1 -dimensional such that we can write: w m1 wT ϕ(x) > 0 if x ∈ C1 wT ϕ(x) < 0 if x ∈ C2

(2.2) (2.3)

wT ϕ(x) = 0

(2.4)

where the equation

indicates the hyperplane of separation, or describes the surface of separation between the two classes in the space ϕ (or hidden space). The surfaces of separation between populations of objects can be hyperplanes (first order), quadrics (second order) or hypersphere (quadrics with some linear constraints on the coefficients). In Fig. 2.1 the three types of separability are shown. In general, linear separability implies spherical separability which, in turn, implies quadratic separability. The reverse is not necessarily true. Two key points of Cover’s theorem can be summarized as follows:

1. Nonlinear formulation of the functions of the hidden layer defined by ϕi (x) with x the input vector and i = 1, . . . , m1 the cardinality of the layer. 2. A dimensionality greater than the hidden space compared with that of the input space determined as we have seen from the value of m1 (i.e., the number of neurons in the hidden layer).

Fig. 2.1 Examples of binary partition in space for different sets of five points in a 2D space: (a) linearly separable, (b) spherically separable, (c) quadrically separable

a)

b)

c)

196

2 RBF, SOM, Hopfield, and Deep Neural Networks

Table 2.1 Nonlinear transformation of two-dimensional input patterns x Input Pattern

First hidden function

Second hidden function

x

ϕ1

ϕ2

(1, 1)

1

0,1353

(0, 1)

0,3678

0,3678

(0, 0)

0,1353

1

(1, 0)

0,3678

0,3678

It should be noted that in some cases it may be sufficient to satisfy only point 1, i.e., the nonlinear transformation without increasing the input space by increasing the neurons of the hidden layer (point 2), in order to obtain linear separability. The XOR example shows this last observation. Let 4 points in a 2D space: (0, 0), (1, 0), (0, 1), and (1, 1) in which we construct an RBF neural network that solves the XOR function. It has been observed above that the single perceptron is not able to represent this type of function due to the nonlinearly separable problem. Let’s see how by using the Cover theorem it is possible to obtain a linear separability following a nonlinear transformation of the four points. We define two Gaussian transformation functions as follows: 2 t1 = [1, 1]T ϕ1 (x) = e−x−t1  , ϕ2 (x) = e−x−t2  , 2

t2 = [0, 0]T

In the Table 2.1 the values of the nonlinear transformation of the four points considered are shown and in Fig. 2.2 their representation in the space ϕ. We can observe how they become linearly separable after the nonlinear transformation with the help of the Gaussian functions defined above.

2.3 The Problem of Interpolation The Cover’s theorem shows that there is a certain benefit of operating a nonlinear transformation from the input space into a new one with a larger dimension in order to obtain separable patterns due to a pattern recognition problem. Mainly a nonlinear mapping is used to transform a nonlinearly separable classification problem into a linearly separable one. Similarly, nonlinear mapping can be used to transform a nonlinear filtering problem into one that involves linear filtering. For simplicity, consider a feedforward network with an input layer, a hidden layer, and an output layer, with the latter consisting of only one neuron. The network, in this case, operates a nonlinear transformation from the input layer into the hidden one followed by a linear one from the hidden layer into the output one. If m0 always indicates the dimensionality of the input layer (therefore, m0 neurons in the input

2.3 The Problem of Interpolation

1

197

(1,1)

0.8

Separation line or around decision-making

0.6 0.4

(0,1) (1,0)

0.2 0

(0,0) 0

0.2

0.4

0.6

0.8

1

1.2

Fig. 2.2 Representation of the nonlinear transformation of the four points for the XOR problem that become linearly separable in the space ϕ

layer), the network transforms from a m0 -dimensional space into a one-dimensional space since in output we have only one neuron, so the mapping function is expressed as (2.5) s : m0 → 1 We can think of the mapping function s as a hypersurface  ⊂ m0 +1 , analogously to an elementary mapping function s : 1 → 1 with s(x) = x2 , a parabola in space 2 . The surface  is, therefore, a multidimensional plot of network output as a function of input. Generally, the surface  is unknown, with training data contaminated by noise. The two important phases of a classifier, training and test (or generalization) can be seen as follows: (a) the training phase is an optimization of a fitting procedure for the  surface, starting from known examples (i.e., training data) that are presented to the network as input–output pairs (patterns). (b) The generalization phase is synonymous with interpolation between the data, with the interpolation performed along a surface obtained by the training procedure using optimization techniques that allow to have a surface  close to the real one. With these premises we are in the presence of a multivariate interpolation problem in large spaces. Formalization of the interpolation problem

198

2 RBF, SOM, Hopfield, and Deep Neural Networks

Given a set of N different points {xi ∈ m0 |i = 1, . . . , N } and a corresponding set of N real numbers {di ∈ 1 |i = 1, . . . , N } find a function F : m0 → 1 which satisfies the following interpolation condition F(xi ) = di

i = 1, 2, . . . , N

(2.6)

For better interpolation, the interpolating surface (therefore, the function F) passes through all the points of the training data. The RBF technique consists of choosing a function F with the following form: F(x) =

N 

wi ϕ(x − xi )

(2.7)

i=1

where {ϕ(x − xi )|i = 1, 2, . . . , N } is the set of N generally nonlinear functions, called radial basis functions, and  •  denotes the Euclidean norm. The known points of the dataset xi ∈ m0 , i = 1, 2, . . . , N are the centers of the radial functions. By inserting the interpolation condition (2.6) in (2.7), we obtain the set of linear equations for the expansion coefficients (or weights) {wi } ⎡ ⎤⎡ ⎤ ⎡ ⎤ w1 d1 ϕ11 ϕ12 · · · ϕ1N ⎢ ϕ21 ϕ22 · · · ϕ2N ⎥ ⎢ w2 ⎥ ⎢ d2 ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ (2.8) ⎢ .. .. .. .. ⎥ ⎢ .. ⎥ = ⎢ .. ⎥ ⎣ ⎣ . ⎦ . . . . ⎦ ⎣ . ⎦ ϕN 1 ϕN 2 · · · ϕNN

wN

dN

with ϕji = ϕ(xj − xi ) (j, i) = 1, 2, . . . , N

(2.9)

Let d = [d1 , d2 , . . . , dN ]T w = [w1 , w2 , . . . , wN ]T be the vectors d and w of size N × 1 representing respectively the desired responses and weights, with N the size of the training sample. Let  the N × N matrix of elements ϕji  = {ϕji |(j, i) = 1, 2, . . . , N } (2.10) which will be denoted interpolation matrix. Rewriting the Eq. (2.8) in compact form we get w = d (2.11)

2.3 The Problem of Interpolation

199

Assuming that  is a non-singular matrix, there is its inverse −1 , and therefore, the solution of the Eq. (2.11) for weights w is given by w = −1 d

(2.12)

Now how do we make sure that the  is non-singular? The following theorem gives us an answer in this regard.

2.4 Micchelli’s Theorem m0 Theorem 2 (Micchelli’s theorem) Let xi N i=1 be a set of distinct points of  . Then the interpolating matrix  of size N × N whose elements ϕji = ϕ(xj − xi ) are non-singular.

There is a vast class of radial-based functions that satisfy Micchelli’s theorem, in particular, they are Gaussian

r2 2σ 2

(2.13)

r2 + σ 2 σ2

(2.14)

ϕ(r) = exp − where σ > 0 and r ∈ . Multiquadrics

√ ϕ(r) =

Inversemultiquadrics ϕ(r) = √

σ2 r2 + σ 2

(2.15)

Cauchy ϕ(r) =

σ2 r2 + σ 2

(2.16)

The four functions described above are depicted in Fig. 2.3. The radial functions defined in the Eqs. (2.13)–(2.16) so that they are not singular, all the points of the dataset {xi }N i=1 must necessarily be distinct from one another, regardless of the sample size N and the cardinality m0 of the input vectors xi . The inverse multiquadrics (2.15), the Cauchy (2.16) and the Gaussian functions (2.13) share the same property, that is, they are localized functions in the sense that ϕ(r) → 0 for r → ∞. In both these cases, the  is positive definite matrix. In

200

2 RBF, SOM, Hopfield, and Deep Neural Networks 4.5

Gaussian Multiquadric Inverse Multiquadric Cauchy

4 3.5 3 2.5 2 1.5 1 0.5 0 −3

−2

−1

0

1

2

3

Fig. 2.3 Radial-based functions that satisfy Micchelli’s theorem

contrast, the family of multiquadric functions defined in (2.14) are nonlocal because ϕ(r) becomes undefined for r → ∞, and the corresponding interpolation matrix  has N − 1 negative eigenvalues and only one positive, with the consequence of not being positive definite. It can, therefore, be established that an interpolation matrix  based on multiquadric functions (introduced by Hardy [1]) is not singular, and therefore, suitable for designing an RBF network. Furthermore, it can be remarked that radial basis functions that grow to infinity, such as multiquadrics, can be used to approximate smoothing input–output with large accuracy with respect to those that make the interpolation matrix  positive definite (this result can be found in Powell [2]).

2.5 Learning and Ill-Posed Problems The interpolation procedure described so far may not have good results when the network has to generalize (see examples never seen before). This problem is when the number of training samples is far greater than the degrees of freedom of the physical process you want to model, in which case we are bound to have as many radial functions as there are the training data, resulting in an oversized problem. In this case, the network attempts to best approximate the mapping function, responding precisely when a data item is seen during the training phase but fails when one is ever seen. The result is that the network generalizes little, giving rise to the problem of overfitting. In general, learning means finding the hypersurface (for multidimensional problems) that allows the network to respond (generate an

2.5 Learning and Ill-Posed Problems

201

output) to the input provided. This mapping is defined by the hypersurface equation found in the learning phase. So learning can be seen as a hypersurface reconstruction problem, given a set of examples that can be scattered. There are two types of problems that are generally encountered: ill-posed problems and well-posed problems. Let’s see what they consist of. Suppose we have a domain X and a set Y of some metric space, which are related to each other by a functional unknown f which is the objective of learning. The problem of reconstructing the mapping function f is said to be well-posed if it satisfies the following three conditions: 1. Existence. ∀ x ∈ X ∃ y = f (x) such that y ∈ Y 2. Uniqueness. ∀ x, t ∈ X : f (x) = f (t) if and only if x = t 3. Continuity. The mapping function f is continuous, that is ∀  > 0 ∃ δ = δ() such that ρX (x, t) < δ ⇒ ρY (f (x), f (t)) <  where ρ(•, •) represents the distance symbol between the two arguments in their respective spaces. The continuity property is also referred to as being the property of stability. If none of these conditions are met, the problem will be said to be ill-posed. In problems ill-posed, datasets of very large examples may contain little information on the problem to be solved. The physical phenomena responsible for generating the dataset for training (for example, for speech, radar signals, sonar signals, images, etc.) are well-posed problems. However, learning from these forms of physical signals, seen as a reconstruction of hypersurfaces, is an ill-posed problem for the following reasons. The criterion of existence can be violated in the case in which distinct outputs do not exist for each input. There is not enough information in the training dataset to univocally reconstruct the mapping input–output function, therefore, the uniqueness criterion could be violated. The noise or inaccuracies present in the training data add uncertainty to the surface of mapping input–output. This last problem violates the criterion of continuity, since if there is a lot of noise in the data, it is likely that the desired output y falls outside the range Y for a specified input vector x ∈ X . Paraphrasing Lanczos [3] we can say that There is no mathematical artifice to remedy the missing information in the training data. An important result on how to render a problem ill-posed in a well-posed is derived from the theory of the Regularization.

2.6 Regularization Theory Introduced by Tikhonov in 1963 for the solution of ill-posed problems. The basic idea is to stabilize the hypersurface reconstruction solution by introducing some nonnegative functional that integrates a priori information of the solution. The most

202

2 RBF, SOM, Hopfield, and Deep Neural Networks

common form of a priori information involves the assumption that the function of input–output mapping (i.e., solution of the reconstruction problem) is of smooth type, or similar inputs correspond to similar outputs. Let be the input and output data sets (which represent the training set) described as follows: xi ∈ m0 , i = 1, 2, . . . , N

Input signal :

(2.17) Desired response : di ∈ 1 , i = 1, 2, . . . , N The fact that the output is one-dimensional does not affect any generality in the extension to multidimensional output cases. Let F(x) be the mapping function to look for (the weight variable w from the arguments of F has been removed), the Tikhonov regularization theory includes two terms: 1. Standard Error. Denoted by ξs (F) measures the error (distance) between the desired response (target) di and the current network response yi of the training samples i = 1, 2, . . . , N 1 (di − yi )2 2

(2.18)

1 2

(2.19)

N

ξs (F) = =

i=1 N 

(di − F(xi ))2

i=1

where 21 represents a scale factor. 2. Regularization. The second term, denoted by ξc (F), which depends on the geometric properties of the approximating function F(x), is defined by ξc (F) =

1 DF2 2

(2.20)

where D is a linear differential operator. The a priori information of the input– output mapping function is enclosed in D, which makes the selection of D dependent problems. The D operator is also a stabilizer, since it makes the solution of type smooth satisfying the condition of continuity. So the quantity that must be minimized in the regularization theory is the following: ξ(F) = ξs (F) + λξc (F) 1 1 [di − F(xi )]2 + λDF2 2 2 N

=

i=1

where λ is a positive real number called regularization parameter, ξ(F) is called a Tikhonov functional. By indicating with Fλ (x) the surface that minimizes the functional ξ(F), we can see the regularization parameter λ as a sufficiency indicator of

2.6 Regularization Theory

203

the training set that specifies the solution Fλ (x). In particular, in a limiting case when λ → 0 implies that the problem is unconstrained, the solution Fλ (x) is completely determined by the training examples. The other case, where λ → ∞ implies that the continuity constraint introduced by the smooth operator D is sufficient to specify the solution Fλ (x), or another way of saying that the examples are unreliable. In practical applications, the parameter λ is assigned a value between the two boundary conditions, so that both training examples and information a priori contribute together for the solution Fλ (x). After a series of steps we arrive at the following formulation of the functional for the regulation problem: 1 [di − F(xi )]G(x, xi ) λ N

Fλ (x) =

(2.21)

i=1

where G(x, xi ) is called the Green function which we will see later on as one of the radial-based functions. The Eq. (2.21) establishes that the minimum solution Fλ (x) to the regularization problem is the superposition of N Green functions. The vectors of the sample xi represent the expansion centers, and the weights [di − F(xi )]/λ represent the expansion coefficients. In other words, the solution to the regularization problem lies in a N -dimensional subspace of the space of smoothing functions, and the set of Green functions {G(x, xi )} centered in xi , i = 1, 2, . . . , N form a basis for this subspace. Note that the expansion coefficients in (2.21) are: linear in the error estimation defined as the difference between the desired response di and the corresponding output of the network F(xi ); and inversely proportional to the regularization parameter λ. Let us now calculate the expansion coefficients, that are not known, defined by wi =

1 [di − F(xi )], i = 1, 2, . . . , N λ

(2.22)

We rewrite the (2.21) as follows: Fλ (x) =

N 

wi G(x, xi )

(2.23)

i=1

and evaluating the (2.23) in xj for j = 1, 2, . . . , N we get Fλ (xj ) =

N 

wi G(xj , xi ) j = 1, 2, . . . , N

(2.24)

i=1

We now introduce the following definitions: Fλ = [Fλ (x1 ), Fλ (x2 ), . . . , Fλ (xN )]T

(2.25)

204

2 RBF, SOM, Hopfield, and Deep Neural Networks

d = [d1 , d2 , . . . , dN ]T

(2.26)

⎤ G(x1 , x1 ) G(x1 , x2 ) · · · G(x1 , xN ) ⎢ G(x2 , x1 ) G(x2 , x2 ) · · · G(x2 , xN ) ⎥ ⎢ ⎥ G=⎢ ⎥ .. .. .. .. ⎣ ⎦ . . . . G(xN , x1 ) G(xN , x2 ) · · · G(xN , xN )

(2.27)

w = [w1 , w2 , . . . , wN ]T

(2.28)



we can rewrite the (2.22) and the (2.25) in matrix form as follows: w=

1 (d − Fλ ) λ

(2.29)

and Fλ = Gw

(2.30)

deleting Fλ from (2.29) and (2.30) we get (G + λI)w = d

(2.31)

where I is the identity matrix N × N . The G matrix is named Green matrix. Green’s functions are symmetric (for some classes of functions seen above), namely G(xi , xj ) = G(xj , xi ) ∀ i, j,

(2.32)

and therefore, the Green matrix is also symmetric and positive definite if all the points of the sample are distinct between them and we have GT = G

(2.33)

We can think of having a regularization parameter λ big enough to ensure that (G + λI) is positive definite, and therefore, invertible. This implies that the system of linear equations defined in (2.31) has one and only one solution given by w = (G + λI)−1 d

(2.34)

This equation allows us to obtain the vector of weights w having identified the Green function G(xj , xi ) for i = 1, 2, . . . , N ; the desired answer d; and an appropriate value of the regularization parameter λ. In conclusion, it can be established that a solution to the regularization problem is provided by the following expansion:

2.6 Regularization Theory

205

Fλ (x) =

N 

wi G(x, xi )

(2.35)

i=1

This equation establishes the following considerations: (a) the approach based on the regularization theory is equivalent to the expansion of the solution in terms of Green functions, characterized only by the form of stabilizing D and by the associated boundary conditions; (b) the number of Green functions used in the expansion is equal to the number of examples used in the training process. The characterization of the Green functions G(x, xi ) for a specific center xi depend only on the stabilizer D, priori information known based on the input–output mapping. If this stabilizer is invariant from translation, then the Green function centered in xi depends only on the difference between the two arguments G(x, xi ) = G(x − xi )

(2.36)

otherwise if the stabilizer must be invariant both for translation and for rotation, then the function of Green will depend on the Euclidean distance of its two arguments, namely (2.37) G(x, xi ) = G(x − xi ). So under these conditions Green’s functions must be radial-based functions. Then the solution (2.24) can be rewritten as follows: Fλ (x) =

N 

wi G(x − xi )

(2.38)

i=1

Therefore, the solution is entirely determined by the N training vectors that help to construct the interpolating surface F(x).

2.7 RBF Network Architecture As previously mentioned, the network is composed of three layers: input, hidden, and output, as shown in Fig. 2.4. The first layer (input) is made up of m0 nodes, representing the size of the input vector x. The second layer, the hidden one, consists of m1 nonlinear radial-based functions ϕ, connected to the input layer. In some cases the size m1 = N , in others (especially when the training set is very large) differs (m1 N ) as we will see later.

206

2 RBF, SOM, Hopfield, and Deep Neural Networks

Fig. 2.4 RBF network architecture

The output layer consists of a single linear neuron (but can also be composed of several output neurons) connected entirely to the hidden layer. By linear, we mean that the output neuron calculates its output value as the weighted sum of the outputs of the neurons of the hidden layer. The weights wi of the output layer represent the unknown variables, which also depend on the functions of Green G(x − xi ) and the regularization parameter λ. The functions of Green G(x − xi ) for each ith are defined positive, and therefore, one of the forms satisfying this property is the Gaussian one G(x, xi ) = exp (−

1 x − xi 2 ) 2σi2

(2.39)

remembering that xi represents the center of the function and σi its width. With the condition that the Green functions are defined positive, the solution produced by the network will be an optimal interpolation in the sense that it minimizes the cost function seen previously ξ(F). We remind you that this cost function indicates how much the solution produced by the network deviates from the true data represented by the training data. Optimality is, therefore, closely related to the search for the minimum of this cost function ξ(F). In Fig. 2.4 is also shown the bias (variable independent of the data) applied to the output layer. This is represented by placing one of the linear weights equal to the bias w0 = b and treating the associated radial function as a constant equal to +1. Concluding then, to solve an RBF network, knowing in advance the input data and the shape of the radial basis functions, the variables to be searched are the linear weights wi and the centers xi of the radial basis functions.

2.8 RBF Network Solution

207

2.8 RBF Network Solution Let be {ϕi (x)| i = 1, 2, . . . , m1 } the family of radial functions of the hidden layer, which we assume to be linearly independent. We, therefore, define ϕi (x) = G(x − ti ) i = 1, 2, . . . , m1

(2.40)

where ti are the centers of the radial functions to be determined. In the case in which the training data are few or computationally tractable in number, these centers coincide with the training data, that is ti = xi for i = 1, 2, . . . , N . Therefore, the new interpolating solution F ∗ is given by the following equation: F ∗ (x) = =

m1  i=1 m1 

wi G(x, ti ) wi G(x − ti )

i=1

which defines the new interpolating function with the new weights {wi | i = 1, 2, . . . , m1 } to be determined in order to minimize the new cost function ξ(F ∗ ) =

N  i=1

⎛ ⎝di −

m1 

⎞2 wi G(x − ti )⎠ + λDF ∗ 2

(2.41)

j=1

with the first term of the right side of this equation it can be expressed as the Euclidean norm of d − Gw2 , where d = [d1 , d2 , . . . , dN ]T

(2.42)

⎤ G(x1 , t1 ) G(x1 , t2 ) · · · G(x1 , tm1 ) ⎢ G(x2 , t1 ) G(x2 , t2 ) · · · G(x2 , tm1 ) ⎥ ⎢ ⎥ G=⎢ ⎥ .. .. .. .. ⎣ ⎦ . . . . G(xN , t1 ) G(xN , t2 ) · · · G(xN , tm1 )

(2.43)

w = [w1 , w2 , . . . , wm1 ]T

(2.44)



Now the G matrix of Green’s functions is no longer symmetrical but of size N × m1 , the vector of the desired answers d is like before of size N , and the weight vector w of size m1 × 1. From the Eq. (2.24) we note that the approximating function is a linear combination of Green’s functions for a certain stabilizer D. Expanding the second term of (2.41), omitting the intermediate steps, we arrive at the following result: (2.45) DF ∗ 2 = wT G0 w

208

2 RBF, SOM, Hopfield, and Deep Neural Networks

where G0 is a symmetric matrix of size m1 × m1 defined as follows: ⎡ ⎤ G(t1 , t1 ) G(t1 , t2 ) · · · G(t1 , tm1 ) ⎢ G(t2 , t1 ) G(t2 , t2 ) · · · G(t2 , tm1 ) ⎥ ⎢ ⎥ G0 = ⎢ ⎥ .. .. .. .. ⎣ ⎦ . . . .

(2.46)

G(tm1 , t1 ) G(tm1 , t2 ) · · · G(tm1 , tm1 )

and minimizing the (2.41) with respect to the weight vector w, we arrive at the following equation: (GT G + λG0 )w = GT d (2.47) which for λ → 0 the weight vector w converges to the pseudo-inverse (minimum norm) solution for a least squares fitting problem for m1 < N , so we have w = G+ d, λ = 0

(2.48)

where G+ represents the pseudo-inverse of the matrix G, that is G+ = (GT G)−1 GT

(2.49)

The (2.48) represents the solution to the problem of learning weights for an RBF network. Now let’s see what are the RBF learning strategies that starting from a training set describe different ways to get (in addition to the weight vector calculation w) also the centers of the radial basis functions of the hidden layer and their standard deviation.

2.9 Learning Strategies So far the solution of the RBF has been found in terms of the weights between the hidden and output layers, which are closely related to how the activation functions of the hidden layer are configured and eventually evolve over time. There are different approaches for the initialization of the radial basis functions of the hidden layer. In the following, we will show some of them.

2.9.1 Centers Set and Randomly Selected The simplest approach is to fix the Gaussian centers (radial basis functions of the hidden layer) chosen randomly from the training dataset available. We can use a Gaussian function isotropic whose standard deviation σ is fixed according to the dispersion of the centers. That is, a normalized version of the radial basis function centered in ti

2.9 Learning Strategies

209

  m1 2 (G(x − ti  ) = exp − 2 x − ti  , i = 1, 2, . . . , m1 dmax 2

(2.50)

where m1 is the number of centers (i.e., neurons of the hidden layer), dmax the maximum distance between the centers that have been chosen. The standard deviation (width) of the Gaussian radial functions is fixed to dmax . σ =√ 2m1

(2.51)

The latter ensures that the identified radial functions do not reach the two possible extremes, that is, too thin or too flattened. Alternatively to the Eq. (2.51) we can think of taking different versions of radial functions. That is to say using very large standard deviations for areas where the data is very dispersed and vice versa. This, however, presupposes a research phase carried out first about the study of the distribution of the data of the training set which is available. The network parameters, therefore, remain to be searched, that is, the weights of the connections going from the hidden layer to the output layer with the pseudo-inverse method of G described above in (2.48) and (2.49). The G matrix is defined as follows: G = {gji } with

  m 1 gji = exp − 2 xj − ti 2 , i = 1, 2, . . . , m1 j = 1, 2, . . . , N d

(2.52)

(2.53)

with xj the jth vector of the training set. Note that if the samples are reasonably1 few so as not to affect considerably the computational complexity, one can also fix the centers of the radial functions with the observations of the training data, then ti = xi . The computation of the pseudo-inverse matrix is done by the Singular Value Decomposition (SVD) as follows. Let G be a matrix N × M of real values, there are two orthogonal matrices: (2.54) U = [u1 , u2 , . . . , uN ] and V = [v1 , v2 , . . . , vM ]

(2.55)

UT GV = diag(σ1 , σ2 , . . . , σK ) K = min(M , N )

(2.56)

such that

1 Always satisfying the theorem of Cover described above, with reasonably we want to indicate a size correlated with the computational complexity of the entire architecture.

210

2 RBF, SOM, Hopfield, and Deep Neural Networks

where σ1 ≥ σ2 ≥ · · · ≥ σK > 0

(2.57)

The column vectors of the matrix U are called left singular vectors of G while those of V right singular vectors of G. The standard deviations σ1 , σ2 , . . . , σK are simply called singular values of the matrix G. Thus according to the SVD theorem, the pseudo-inverse matrix of size M × N of a matrix G is defined as follows: G+ = V + UT

(2.58)

where + is the matrix N × N defined in terms of singular values of G as follows:   1 1 1 + = diag , ,..., , 0, . . . , 0 (2.59) σ1 σ2 σK The random selection of the centers shows that this method is insensitive to the use of regularization.

2.9.2 Selection of Centers Using Clustering Techniques One of the problems that you encounter when you have very large datasets is the inability to set the centers based on the size of the training dataset, whether they are randomly selected and whether they coincide with the same training data. Training datasets with millions of examples would involve millions of neurons in the pattern layer, with the consequence of raising the computational complexity of the classifier. To overcome this you can think of finding a number of centers that are lower than the cardinality of the training dataset but which is in any case descriptive in terms of the probability distribution of the examples you have. To do this you can use clustering techniques such as fuzzy K-means, K-means or self-organizing maps. Therefore, in a first phase, the number of prototypes to be learned with the clustering technique is set, and subsequently, we find the weights of the RBF network with the radial basis functions centered in the prototypes learned in the previous phase.

2.10 Kohonen Neural Network Note as Self-Organizing Map (SOM) (also called Kohonen map) is a computational model to visualize and analyze high-dimensional data by projecting them (mapping) into a low-dimensional space (up to 1D), preserving how possible distances between input patterns. The maps of Kohonen [4] establish a topological relationship between multidimensional patterns and a normally 2D grid of neurons (Kohonen layers) preserving the topological information, i.e., similar patterns are represented to neighboring neurons.

2.10 Kohonen Neural Network

211

In essence, this layer of neurons adapts during the learning phase, in the sense that the positions of the individual neurons are indicators of the significant statistical characteristics of the input stimuli. This process of spatial adaptation of input pattern characteristics is also known as feature mapping. The SOMs learn without a priori knowledge in unsupervised mode, from which descends the name of self-organizing networks, i.e., they are able to interact with the data, training themselves without a supervisor. Like all neural networks, SOMs have a neuro-biological motivation, based on the spatial organization of brain functions, as has been observed especially in the cerebral cortex. Kohonen developed the SOM-based on the studies of C. von der Malsburg [5] and on the models of the neural fields of Amari [6]. A first emulative feature of SOM concerns the behavior of the human brain when subjected to an input signal. When a layer of the neural network of the human brain receives an input signal, very close neurons are strongly excited with stronger bonds, while those at an intermediate distance are inhibited, and distant ones are weakly excited. Similarly in SOM, during learning, the map is partitioned into regions, each of which represents a class of input patterns (principle of topological map formation). Another characteristic of biological neurons, when stimulated by input signals, is that of manifesting an activity in a coordinated way such as to differentiate themselves from the other less excited neurons. This feature was modeled by Kohonen restricting the adaptation of weights only to neurons in the vicinity of what will be considered the winner (competitive learning). This last aspect is the essential characteristic of unsupervised systems, in which the output neurons compete with each other before being activated, with the result that only one is activated at any time. The winning neuron is called winner-takes-all-neuron (the winning neuron takes everything).

2.10.1 Architecture of the SOM Network Similarly to the multilayer neural networks, the SOM presents a feed-forward architecture with a layer of input neurons and a single layer of neurons, arranged on a regular 2D grid, which combine the computational output functions (see Fig. 2.5). The computation-output layer (Kohonen layer) can also be 1D (with a single row or column of neurons) and rarely higher than 2D maps. Each input neuron is connected to all the neurons of the Kohonen layer. Let x = (x1 , x2 , . . . , xd ) the generic pattern d -dimensional of the N input patterns to present to the network and both PEj (Processing Element) generic neuron of the 2D grid composed of M = Mr × Mc neurons arranged on Mr rows and Mc columns. The input neurons xi , i = 1, . . . , d only perform the memory function and are connected to the neurons PEj , j = 1, . . . , M through the vectors weight wj , j = 1, M of the same dimensionality d of the input pattern vectors. For a configuration with N input pattern vectors and M neurons PE of the Kohonen layer we have in total N M connections. The activation potential yj of the single neuron PEj is given by the

212

2 RBF, SOM, Hopfield, and Deep Neural Networks

Fig. 2.5 Kohonen network architecture. The input layer has d neurons that only have a memory function for the input patterns x d -dimensional, while the Kohonen layer has 63 PE neurons (process elements). The window size, centered on the winning neuron, gradually decreases with the iterations and includes the neurons of the lateral interaction

inner product between the generic vector pattern x and the vector weight wj yj =

wjT x

=

d 

wji xi

(2.60)

i=1

The initial values of all the connection weights wji , j = 1, . . . , M , i = 1, . . . , d are assigned randomly and with small values. The self-organizing activity of the Kohonen network involves the following phases: Competition, for each input pattern x, all neurons PE calculate their respective activation potential which provides the basis of their competition. Once the discriminant evaluation function is defined, only one neuron PE must be winning in the competition. As a discriminant function the (2.60) can be used and the one with maximum activation potential yv is chosen as the winning neuron, as follows: yv = arg max {yj = j=1,...,M

d 

wji xi }

(2.61)

i=1

An alternative method uses the minimum Euclidean distance as the discriminant function. Dv (x) between the vector x and the vectors weight wj to determine the winning neuron, given from   d    (2.62) yv = Dv (x) = arg min Dj = x − wj =  (xi − wji )2 j=1,...,M

i=1

2.10 Kohonen Neural Network

213

With the (2.62), the one whose weight vector is closest to the pattern presented as input to the network is selected as the winning neuron. With the inner product it is necessary to normalize the vectors with unitary norm (|x| = |w| = 1). With the Euclidean distance the vectors may not be normalized. Both the two discriminant functions are equivalent, i.e., the vector of weights with minimum Euclidean distance from the input vector is equivalent to the weight vector which has the maximum inner product with the same input vector. With the process of competition between neurons, PE, the continuous input space is transformed (mapped) into the discrete output space (Kohonen layer). Cooperation, this process is inspired by the neuro-biological studies that demonstrate the existence of a lateral interaction that is a state of excitation of neurons close to the winning one. When a neuron is activated, neurons in its vicinity tend to be excited with less and less intensity as their distance from it increases. It is shown that this lateral interaction between neurons can be modeled with a function with circular symmetry properties such as the Gaussian and Laplacian function of the Gaussian (see Sect. 1.13 Vol.II). The latter can achieve an action of lateral reinforcement interaction for the neurons closer to the winning one and an inhibitory action for the more distant neurons. For the SOM a similar topology of proximity can be used to delimit the excitatory lateral interactions for a limited neighborhood of the winning neuron. If Djv is the lateral distance between the neuron jth and the winning one v, the Gaussian function of lateral attenuation φ is given by φ(j, v) = e



2 Djv 2σ 2

(2.63)

where σ indicates the circular amplitude of the lateral interaction centered on the winning neuron. The (2.63) has the property of having the maximum value in the position of the winning neuron, circular symmetry, decreases monotonically to zero as the distance tends to infinity and is invariant with respect to the position of the winning neuron. Neighborhood topologies can be different (for example, proximity to 4, to 8, hexagonal, etc.) what is important is the variation over time of the extension of the neighborhood σ (t) which is useful to reduce it over time until only winning neuron. A method to progressively reduce over time, that is, as the iterations of the learning process grow, is given by the following exponential form: t (2.64) σ (t) = σ0 e− Dmax where σ0 indicates the size of the neighbourhood at the iteration t0 and Dmax is the maximum dimension of the lateral interaction that decreases during the training phase. These parameters of the initial state of the network must be selected appropriately. Adaptation, this process carries out the formal learning phase in which the Kohonen layer self-organizes by adequately updating the weight vector of the winning neuron and the weight vectors of the neurons of the lateral interaction according to the Gaussian attenuation function (2.63). In particular, for the latter, the adaptation of the weights is smaller than the winning neuron. This happens as

214

2 RBF, SOM, Hopfield, and Deep Neural Networks

input pattern vectors are presented to the network. The equation of adaptation (also known as Hebbian learning) of all weights is applied immediately after determining the v winning neuron, given by wji (t + 1) = wji (t) + η(t)φ(j, v)[xi (t) − wji (t)]

i = 1, . . . , d

(2.65)

where j indicates the jth neuron included in the lateral interaction defined by the Gaussian attenuation function, t + 1 indicates the current iteration (epoch) and η(t) controls the learning speed. The expression η(t)φ(j, v) in the (2.65) represents the weight factor with which the weight vector of the winning neuron and of the neurons included in the neighborhood of the lateral interaction φ(j, v) are modified. The latter, given by the (2.63), also dependent on σ (t), we have seen from the (2.64) that it is useful to reduce it over time (iterations). Also η(t) is useful to vary it over the time starting from a maximum initial value and then reducing it exponentially over the time as follows: t

η(t) = η0 e− tmax

(2.66)

where η0 indicates the maximum initial value of the learning function η and tmax indicates the maximum number of expected iterations (learning periods). From the geometric point of view, the effect of learning for each epoch, obtained with the (2.65), is to adjust the weight vectors wv of the winning neuron and those of the neighborhood of the lateral interaction and move them in the direction of the input vector x. Repeating this process for all the patterns in the training set realizes the self-organization of the Kohonen map or its topological ordering. In particular, we obtain a bigection between the feature space (input vectors x) and the discrete map of Kohonen (winning neuron described by the weight vector wv ). The weight vectors w can be used as pointers to identify the vector of origin x in the feature space (see Fig. 2.6). Below is the Algorithm 23 learning algorithm of the Kohonen network. Once the network has been initialized with the appropriate parameters, for example, η0 = 0.9, η ≥ 0.1, tmax ≈ 1000

Dmax ≈ 1000/ log σ0

starting with the completely random weight vectors, the initial state of the Kohonen map is totally disordered. Presenting to the network the patterns of the training set gradually triggers the process of self-organization of the network (see Fig. 2.8) during which the topological ordering of the output neurons is performed with the weight vectors that map as much as possible the input vectors (network convergence phase). It may occur that the network converges toward a metastable state, i.e., the network converges toward a disordered state (in the Kohonen map we have topological defects). This occurs when the lateral interaction function φ(t) decreases very quickly.

2.10 Kohonen Neural Network

215

Fig. 2.6 Self-organization of the SOM network. It happens through a bijection between the input space X and the map of Kohonen (in this case 2D). Presenting to the network a pattern vector x the winning neuron PEv whose associated weight vector wv is determined the most similar to x. Repeating for all the training set patterns, this competitive process produces a tessellation of the input space in regions represented by all the winning neurons whose vectors weight wvj realize the discretization once reprojected in the input space, thus producing the Voronoi tesselation. Each Voronoi region is represented by the weight vectors of the winning neurons (prototypes of the input vectors included in the relative regions) 1.2

1

0.8

0.6

0.4

X: 0.85 Y: 0.26

0.2

0

−0.2 0.2

0.3

0.4

0.5

0.6

0.7

Fig. 2.7 Application of a 1D SOM network for classification

0.8

0.9

1

216

2 RBF, SOM, Hopfield, and Deep Neural Networks

Algorithm 23 SOM algorithm Initialize: t ← 0, η0 , σ0 , tmax , Dmax ; Create a grid of M neurons by associating d-dimensional weight vectors wjT = (wj1 , . . . , wjd ); j = 1, . . . , M ; N ← Training set pattern number; Initialize: Assign small random initial values for weight vectors wj , j = 1, . . . , M ; repeat for i = 1 to N do t ← t + 1; Present each pattern vector to the network x chosen randomly from the training set; Calculate the winning neuron with the (2.62) wv ; Update the weight vectors including those of neighboring neurons with the (2.65); end for Reduce η(t) and σ (t) according to (2.66) and (2.64) until (t < tmax ) end

Normally the convergence phase requires a number of iterations related to the number of neurons (at least 500 times the number of neurons M ). Like the MLP network, once the synaptic weights have been calculated with the training phase, the SOM network is used in the test context to classify a generic pattern vector x not presented in the training phase. In Fig. 2.7 shows a simple example of classification with the SOM 1D network. The number of classes is 6, each of which has 10 2D input vectors (indicated with the “+” symbol). The network is configured with 6 neurons with associated initial weight vectors wi = (0.5, 0.5), i = 1, . . . , 6, and the initial learning parameter η = 0.1. After the training, the weight vectors, adequately modified by the SOM, each represent the prototype of the classes. They are indicated with the symbol “◦” and are located in the center of each cluster. The SOM network is then presented with some input vectors (indicated with black squares), for testing the network, each correctly classified in the class to which they belong. The Kohonen network, by projecting a d -dimensional vector in a discrete 2D grid, actually performs a transformation of the data reducing the dimensionality of the data as happens with the transformation to the main components (PCA). In essence, it realizes a nonlinear generalization of the PCA. Let us now examine some peculiar properties of the Kohonen network. Approximationoftheinputspace. Once the SOM algorithm converges, the resulting Kohonen map displays important statistical features of the input feature space. The SOM algorithm can be thought of as a nonlinear project ψ which projects the continuous input space (feature space) X into the discrete output space L. This

2.10 Kohonen Neural Network

217

1

2

3

w

Kohonen network Before training

After training

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

−0.2

−0.2

−0.4

−0.4

−0.6

−0.6

−0.8

−0.8

−1 −1

−0.5

0

0.5

1

−1 −1

(0.9806, -0.0001)

(-0.6975, -0.6974)

−0.5

0

0.5

1

Fig. 2.8 Example of Kohonen 1D network that groups 6 2D input vectors in 3 classes; using Matlab the network is configured with 3 neurons. Initially, the 3 weight vectors take on small values and are randomly oriented. As the input vectors are presented (indicated with “+”), the weight vectors tend to move toward the most similar input vectors until they reach the final position to represent the prototype vectors (indicated with “◦” ) of each grouping

transformation, ψ : X → L, can be seen as an abstraction that associates a large set of input vectors {X} with a small set of prototypes {W} (the winning neurons) which are a good approximation of the original input data (see Fig. 2.6). This process is the basic idea of the theory of vector quantization (Vector Quantization-VQ) which aims to reduce the dimensionality or to realize data compression. Kohonen has developed two methods for pattern classification, a supervised one known as Learning Vector Quantization-LVQ [7] and an unsupervised one which is the SOM. The SOM algorithm behaves like a data encoder where each neuron is connected to the input space through the relative synaptic weights that represent a point of the input space. Through the ψ transformation, to this neuron corresponds a region of the input space constituted by the set of input vectors that have made the same neuron win by becoming the prototype vector of this input region. If we consider this in a neurobiological context, the input space can represent the coordinates of the set of somatosensory receptors densely distributed over the entire surface of the human body, while the output space represents the set of neurons located in the somatosensory cortex layer where the receptors (specialized for the perception of texture, touch, shape, etc.) are confined. The results of the approximation of the input space depend on the adequacy of the choice of parameters and on the initialization of the synaptic weights. A measure

218

2 RBF, SOM, Hopfield, and Deep Neural Networks

of this approximation is obtained by evaluating the average quantization error Eq which must be the minimum possible by comparing the winning neurons wv and the input vectors x. The average of  xi − wv , defined by presenting the training set patterns again after learning, is used to estimate the error as follows: Eq =

N 1   xi − wv  N

(2.67)

i=1

where N is the number of input vectors x and wv is the corresponding winning weight vector. Topological Ordering. The ψ transformation performed by the SOM produces a local and global topological order, in the sense that neurons, close to each other in the Kohonen map, represent similar patterns in the input space. In other words, when pattern vectors are transformed (mapped) on the grid of neurons represented by neighboring winning neurons, the latter will have associated similar patterns. This is the direct consequence of the process of adaptation of the synaptic weights Eq. (2.65) together with the lateral interaction function. This forces the weight vector of the winning neuron wv to move in the direction of the input vector x, as well as the weight vectors wj of neighboring neurons jths move in the direction of the winning neuron v, to ensure global ordering. The topological ordering, produced by the ψ transformation with the SOM, can be visualized thinking of this transformation as an elastic network of output neurons placed in the input space. To illustrate this situation, consider a two-dimensional input space. In this way, each neuron is visualized in the input space at the coordinates defined by the relative weights (see Fig. 2.9). Initially, we would see a total disorder while after the training we have the topological ordering. If a number of neurons equal to the number of 2D input patterns are used, the neurons projected by the respective weights in the input plane will be in the vicinity of their corresponding input patterns, thus observing an image of the neuron grid ordered at the end of the process of training. Density Matching. The ψ transformation of SOM reflects the intrinsic variations of the input pattern statistics. The density of the winning neurons of an ordered map will reflect the distribution density of the patterns in the training set. Regions of the input space with high density, from which the training patterns come, and therefore, with a high probability of being presented to the network, will produce winning neurons very close to each other. On the contrary, in less dense regions of input patterns there will be scattered winning neurons. However, there will be a better resolution for patterns with high probability than patterns presented with low probability. SOM, therefore, tends to over-represent regions of input with low probability and represent less regions with high probability. A heuristic can be used to relate the probability distribution of the input vectors p(x) to a magnification 2 factor m(x) ∝ p 3 (x) of transformed patterns, valid for one-dimensional patterns. Feature Selection. Given a set of input patterns with a nonlinear distribution, the self-organizing map is able to select a set of significant features to approximate the underlying distribution. Recall that the transform to the principal components

2.10 Kohonen Neural Network Iter=0

1

219 Iter=25

1

0.9

0.9

0.8

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5

0.5

0.4

0.4

0.4

0.3

0.3

0.3

0.2

0.2

0.2

0.1

0.1

0

0.1 0

0 0

0.2

0.4

0.6

0.8

1

Iter=300

1

0.9

0

a)

0.2

0.4

0.6

b)

0.8

1

0

0.2

0.4

0.6

0.8

1

c)

Fig. 2.9 Simulation of a Kohonen SOM network with 10 × 10 neurons: a displaying uniformly distributed 2D input vectors in the range [0, 1] × [0, 1] with overlapping weight vectors whose initial assigned values are around zero; b position of the vectors weight linked to each other, after 25 iterations; c Weights after 300 iterations

achieves the same objective by diagonalizing the correlation matrix to obtain the associated eigenvectors and eigenvalues. If the data do not have a linear distribution, the PCA does not work correctly while the SOM overcomes this problem by virtue of its topological ordering property. In other words, the SOM is able to sufficiently approximate a nonlinear distribution of data by finding the principal surface, and can be considered as a nonlinear generalization of the PCA.

2.10.2 SOM Applications Many applications have been developed with the Kohonen network. An important fallout is represented by the fact that this simple network model offers plausible explanations on some neuro-biological phenomena. The Kohonen network is used in the fields of the combinatory calculus to solve the traveling salesman problem, in the fields of Economic Analysis, Data Mining, Data Compression, Recognition of Phonemes in real time, and in the field of robotics to solve the problem of inverse kinematics. Several applications have been developed for signal and image processing (segmentation, classification, texture, ...). Finally, various academic and commercial software packages are available. To improve the classification process, in some applications, Kohonen maps can be given in input to a linear classification process of the supervised type. In this case, we speak of a hybrid neural network that combines the SOM algorithm that produces the unsupervised feature maps with the supervised linear one of a backpropagation MLP network to achieve a more accurate and more efficient adaptive classification requiring a smaller number of iterations.

220

2 RBF, SOM, Hopfield, and Deep Neural Networks

2.11 Network Learning Vector Quantization-LVQ The concept of Vector Quantization-VQ can be understood by considering the SOM algorithm which actually encodes a set of input vectors x generating a reduced set of prototypes wv (associated with winning neurons) which provide a good approximation of the entire original input space. In this context, these prototypes {wv } can be named as code-book vectors representative of the origin vectors. In essence, the basic idea of Vector Quantization theory is to reduce the dimensionality of the data, which is their compression. We have also seen with the (2.67) how to estimate the error of this approximation with the quantization of the vectors evaluating the Euclidean distance between the input vectors {x} and the prototype vectors {wv }. In a more formal way, the best way of thinking about vector quantization is in terms of encoder which encodes a given signal and decoder which reconstructs as much as possible the original signal by minimizing the information lost with encoding and decoding. In considering this parallelism between the data process with the model encoder/decoder and the SOM algorithm, we can imagine the encoding function c(x) of the encoder as the winning neuron associated with {wv } of the SOM, the decoding function x (c) of the decoder as the connection weight vector {wv }, and the probability density function of the input x (including the additive noise) as the lateral interaction function φ(t). A vector quantizer with minimal distortion error is called a Voronoi quantizer or nearest-neighbor quantizer. This is because the input space is partitioned into a set of Voronoi regions or nearest-neighbor regions each of which contains the associated reconstruction vector. The SOM algorithm provides a non-supervised method useful for calculating the Voronoi vectors with the approximation obtained through the weight vectors of the winning neurons in the Kohonen map. The LVQ approach, devised by Kohonen [8], is the supervised learning version of a vector quantizer that can be used when input data has been labeled (classified). LVQ uses class information to slightly shift the Voronoi vectors in order to improve the quality of the classifier’s decision regions. The procedure is divided into two stages: the SOM-based competitive learning process followed by LVQ supervised learning (see Fig. 2.10a). In fact, it carries out a pattern classification procedure. With the first process, SOM non-supervised associates the vectors x of the training set with the weight/vectors of Voronoi {wv } obtaining a partition of input space. With the second process, LVQ knowing for each vector of the training set the class of membership allows to find the best labeling for each neuron through the associated weight vector wv , i.e., for each Voronoi region. In general, the Voronoi regions do not precisely delimit the boundaries of class separation. The goal is to change the boundaries of these regions to obtain an optimal classification. LVQ actually achieves a proper displacement of the Voronoi region boundaries starting from the {wv } prototypes found by the SOM for the training set {x} and uses the knowledge of the labels assigned to the x patterns to find the best label to assign to each prototype wv . LVQ verifies the class of the input vector with the class to which

2.11 Network Learning Vector Quantization-LVQ

221

SOM

3

LVQ Class Labels

a)

2 Input

Input

}

1

Supervised Learning

b)

xd

wd

M

M

Kohonen Layer

yM

}.. }ω

C

Fig. 2.10 Learning Vector Quantization Network: a Functional scheme with the SOM component for competitive learning and the LVQ component for supervised learning; b architecture of the LVQ network, composed of the input layer, the layer of Kohonen neurons and the computation component to reinforce the winning neuron if the input has been correctly classified by the SOM component

the prototype wv belongs and reinforces it appropriately if they belong to the same class. Figure 2.10b shows the architecture of the LVQ network. At the schematic level, we can consider it with three linear layers of neurons: input, Kohonen, and output. In reality, the M process neurons are only those of the Kohonen layer. The d input neurons only have the function of storing the input vectors {x} randomly presented individually. Each input neuron is connected with all the neurons of the Kohonen layer. The number of neurons in the output layer is equal to the C number of the classes. The net is very conditioned by the number of neurons used for the Kohonen layer. Each neuron of this layer represents a prototype of a class whose values are defined by the vectors weight wv , i.e., the synaptic connections of each neuron connected to all the input neurons. The number of neurons in the middle layer M is a multiple of the number of C classes. In Fig. 2.10b, the output layer shows the possible clusters of neurons that LVQ has detected to represent the same class ωj , j = 1, . . . , C. We now describe the sequential procedure of the basic LVQ algorithm 1. Given the training set P = {x1 , . . . , xN }, xi ∈ Rd and the weight vectors (Voronoi vectors) wj , j = 1, . . . , M obtained through non-supervised learning with the SOM network. These weights are the initial values for the LVQ algorithm. 2. Use the classification labels of the input patterns to improve the classification process by appropriately modifying each wj prototype. LVQ verifies, for each input randomly selected from the training set D = {(xi , ωxi )}, i = 1, . . . , n (containing C classes), the input class ωi with the class associated to the Voronoi regions by adequately modifying the wj weight vectors. 3. Randomly selects an input vector x from the training set D. If the selected input vector xi and the weight vector wv of the winning neuron (i.e., wv is the one closest to xi in the Euclidean distance sense) have the same label (ωxi = ωwv ) as class, then modify the prototype wv representing this class by reinforcing it,

222

2 RBF, SOM, Hopfield, and Deep Neural Networks

as follows: wv (t + 1) = wv (t) + η(t)[xi − wv (t)]

(2.68)

where t indicates the previous iteration, and η(t) indicates the current value of the learning parameter (variable in the range 0 < η(t) ≤ 1) analogous to that of the SOM. 4. If the selected input vector xi and the weight vector wv of the winning neuron have a different label (ωxi = ωwv ) as a class, then modify the prototype wv , removing it from the input vector, as follows: wv (t + 1) = wv (t) − η(t)[xi − wv (t)]

(2.69)

5. All other weight/prototype vectors wj = wv , j = 1, . . . , M associated with the input regions are not changed. 6. It is convenient that the learning parameter η(t) (which determines the magnitude of the prototype shift with respect to the input vector) decreases monotonically with the number of iterations t starting from an initial value ηmax (ηmax 1) and reaching very small values greater than zero   t (2.70) η(t + 1) = η(t) 1 − tmax where tmax indicates the maximum number of iterations. 7. LVQ stop condition. This condition can be reached when t = tmax or by imposing a lower limit on the learning parameter η(t) = ηmin . If the stop condition is not reached the iterations continue starting from step 3. The described classifier (also known as LVQ1) is more efficient than using the SOM algorithm only. It is also observed that with respect to the SOM it is no longer necessary to model the neurons of the Kohonen layer with the function φ of lateral interaction since the objective of LVQ is vector quantization and not the creation of topological maps. The goal of the LVQ algorithm is to adapt the weights of the neurons to optimally represent the prototypes of the training set patterns to obtain a correct partition of this. This architecture allows us to classify the N input vectors of the training set into C classes, each of which is subdivided into subclasses of the latter represented by the initial M prototypes/code-books. The sizing of the LVQ network is linked to the number of prototypes that defines the number of neurons in the Kohonen layer. An undersizing involves partitions with few regions and with the consequent problem of having regions with patterns belonging to different classes. An over-dimensioning instead involves the problem of overfitting.

2.11 Network Learning Vector Quantization-LVQ

223

2.11.1 LVQ2 and LVQ3 Networks In relation to the initial value of the neuron weight vectors, these must be able to move through a class that they do not represent, in order to associate with another region of belonging. Since the weight vectors of these neurons will be rejected by the vectors in the regions they must cross, it is possible that this will never be realized and will never be classified in the correct region to which they belong. The LVQ2 algorithm [9] which introduces a learning variant with respect to LVQ1, can solve this problem. During the learning, for each input vector x, the simultaneous update is carried out considering the two prototype vectors wv1 and wv2 closer to x (always determined with the minimum distance from x). One of them must belong to the correct class and the other to a wrong class. Also, x must be in a window between the vectors wv1 and wv2 that delimit the decision boundaries (perpendicular plane bisecting). In these conditions, the two weight vectors wv1 and wv2 are updated appropriately using the Eqs. (2.68) and (2.69), respectively, of correct and incorrect class of membership of x. All other weight vectors are left unchanged. The LVQ3 algorithm [8] is the analogue of LVQ2 but has an additional update of the weights in cases in which x, wv1 , and wv2 represent the same class wk (t + 1) = wk (t) + η(t)[x − wk (t)]

k = v1, v2

0.1 <  < 0.5 (2.71)

where  is a stabilization constant that reflects the width of the window associated with the borders of the regions represented by the prototypes wv1 and wv2 . For very narrow windows, the constant  must take very small values. With the changes introduced by LVQ2 and LVQ3 during the learning process, it is ensured that weight vectors (code-book vectors) continue to approximate class distributions and prevent them from moving away from their optimal position if learning continues.

2.12 Recurrent Neural Networks An alternative architecture to the feedforward network, described in the preceding paragraphs, is the recurrent architecture (also called cyclical). The topology of a cyclic network requires that at least one neuron (normally a group of neurons) is connected in such a way as to create a cyclic and circular data flow (loop). In the feedforward topology only the error was propagated backward during learning. In recurrent networks, at least one neuron propagates its output backward (feedback) which becomes its input simultaneously. A feedforward network behaves like a static system, completely described by one or more functional, where the current output y depends only on the current inputs x in relation to the mapping function y = F(x). The recurrent networks, on the other hand, are dynamic systems where the output signal, in general, depends on its internal state (due to the feedback) which exhibits

224

2 RBF, SOM, Hopfield, and Deep Neural Networks

Fig. 2.11 Computational dynamics of a neural network: a Static system; b Dynamic continuoustime system; and c Discrete-time dynamic system

a dynamic temporal behavior and the last input signal. Two classes of dynamical systems are distinguished: dynamic systems with continuous time and discrete time. The dynamics of continuous-time systems depend on functions whose continuous variable is time (spatial variables are also used). This dynamic is described by differential equations. A model of the most useful dynamics is that described only by differential equations of the first order y (t) = f [x(t), y(t)] where y (t) = dy(t)/dt which models the output signal as a function of the derivative with respect to time, requiring an integration operator and the feedback signal inherent to the dynamic systems (see Fig. 2.11). In many cases, a discrete-time computational system is assumed. In these cases, a discrete-time system is modeled by discrete-time variable functions (even spatial variables are considered). The dynamics of the network, in these cases, starts from the initial state at time 0 and in the subsequent discrete steps for t = 1, 2, 3, . . . the state of the network changes in relation to the computational dynamics foreseen by the activation function of one or more neurons. Thus, each neuron acquires the related inputs, i.e., the output of the neurons connected to it, and updates its state with respect to them. The dynamics of a discrete-time network is described by difference equations whose first discrete derivative is given by y(n) = y(n + 1) − y(n) where y(n + 1) and y(n) are, respectively, the future value (predicted) and the current value of y, and n indicates the discrete variable that replaces the continuous and independent variable t. To model the dynamics of a discrete-time system, that is, to obtain the output signal y(n + 1) = f [x(n), y(n)], the integration operator is replaced with the summation operator D which has the function of delay unit (see Fig. 2.11c).2 The state of neurons can change independently of each other or can be controlled centrally, and in this case, we have asynchronous or synchronous neural network models, respectively. In the first, case the neurons are updated one at a time, while in the second case all the neurons are updated at the same time. Learning with a recurrent network can be accomplished with a procedure similar to the gradient descent as used with the backpropagation algorithm.

transform applied to the discrete signals y(n) : n = 0, 1, 2, 3, . . . to obtain analytical solutions to the difference equations. The delay unit is introduced simply to delay the activation signal until the next iteration.

2 The D operator derives from the Z

2.12 Recurrent Neural Networks

225

2.12.1 Hopfield Network A particular recurrent network was proposed in 1982 by J.Hopfield [10]. The originality of Hopfield’s network model was such as to revitalize the entire scientific environment in the field of artificial neural networks. Hopfield showed how a collection of simple process units (for example, perceptrons by McCulloch-Pitts), appropriately configured, can exhibit remarkable computing power. Inspired by physical phenomenologies,3 he demonstrated that a physical system can be used as a potential memory device, once such a system has a dynamic of locally stable states to which it is attracted. Such a system, with its stability and well localized attractors,4 constitutes a model of CAM memory (Content-Addressable Memory).5 A CAM memory is a distributed memory that can be realized by a neural network if each contained (in this context a pattern) of the memory corresponds to a stable configuration of the neural network received after its evolution starting from an initial configuration. In other words, starting from an initial configuration, the neural network reaches a stable state, that is, an attractor associated with the pattern most similar to that of the initial configuration. Therefore, the network recognizes a pattern when the initial stimulus corresponds to something that, although not equal to the stored pattern, is very similar to it. Let us now look at the structural details of the Hopfield network that differs greatly from the two-layer network models of input and output. The Hopfield network is realized with M neurons configured in a single layer of neurons (process unit or process element PE) where each is connected with all the others of the network except with itself. In fact it is a recurrent symmetric network, that is, with a matrix of synaptic weights (2.72) wij = wji , ∀i, j

3 Ising’s

model (from the name of the physicist Ernst Ising who proposed it) is a physicalmathematical model initially devised to describe a magnetized body starting from its elementary constituents. The model was then used to model variegated phenomena, united by the presence of single components that, interacting in pairs, produce collective effects. 4 In the context of neural networks, an attractor is the final configuration achieved by a neural network that, starting from an initial state, reaches a stable state after a certain time. Once an attractor is known, the set of initial states that determine evolution of the network that ends with that attractor is called the attraction basin. 5 Normally different memory devices store and retrieve the information by referring to the memory location addresses. Consequently, this mode of access to information often becomes a limiting factor for systems that require quick access to information. The time required to find an item stored in memory can be considerably reduced if the object can be identified for access through its contents rather than by memory addresses. A memory accessed in this way is called addressable memory for content or CAM-Content-Addressable Memory. CAM offers an advantage in terms of performance on other search algorithms, such as binary tree or look-up table based searches, comparing the desired information against the entire list of pre-stored memory location addresses.

226

2 RBF, SOM, Hopfield, and Deep Neural Networks

Fig. 2.12 Model of the discrete-time Hopfield network with M neurons. The output of each neuron has feedback with all the others excluded with itself. The pattern vector (x1 , x2 , . . . , xM ) forms the entry to the network, i.e., the initial state (y1 (0), y2 (0), . . . , yM (0)) of the network

no neuron is connected with itself wii = 0, ∀i

(2.73)

and is completely connected (see Fig. 2.12). According to Hopfield’s notation wij indicates the synaptic connection from the neuron jth to the neuron ith and the activation level of the neuron ith is indicated with yi which can assume the value 1 if activated and 0 otherwise. In the Hopfield discrete network context, the total instantaneous state of the network is given by M values yi which represents a binary vector of M bit. Neurons are assumed to operate as the perceptron with a threshold activation function that produces a binary state. The state of each neuron is given by the weighted sum of its inputs with synaptic weights as follows:   w y > θi 1 if j=i ij j (2.74) yi = 0 if j=i wij yj < θi where yi indicates the output of the neuron ith and θi the relative threshold. The (2.74) rewritten with the activation function σ () of each neuron becomes yi = σ



  M

wij yj − θi

j=1;j=i





(2.75)



z



where σ (z) =

1 0

if if

z≥0 z θi , the second factor is greater or equal to zero; yi (t + 1) = 1 and yi (t) = 0; it follows that the first factor [•] is less than zero, thus resulting E ≤ 0. Therefore, Hopfield has shown that any change in the ith neuron, if the activation equations are maintained (2.75) and (2.76), the energy variation E is negative or zero. The energy function E decreases in a monotonic way (see Fig. 2.13) if the activation rules and the symmetry of the weights are maintained. This allows the network, after repeated updates, to converge toward a stable state that is a local minimum of the energy function (also considered as a Lyapunov function6 ). 6 Lyapunov

functions, named after the Russian mathematician Aleksandr Mikhailovich Lyapunov, are scalar functions that are used to study the stability of an equilibrium point of an ordinary autonomous differential equation, which normally describes a dynamic system. For dynamic physical systems, conservation laws often provide candidate Lyapunov functions.

228

2 RBF, SOM, Hopfield, and Deep Neural Networks

Fig. 2.13 1D and 2D energy map with some attractors in a Hopfield network. The locations of the attractors indicate the stable states where the patterns are associated and stored. After having initialized the network with the pattern to be memorized, the network converges following the direction indicated by the arrows to reach the location of an attractor to which the pattern is associated

The generalized form of the network update equation also includes the additional term of the bias I (which can also be the direct input of a sensor) for each neuron   w y + Ii > θi 1 if j=i ij j (2.79) yi = 0 if j=i wij yj + Ii < θi It is highlighted that contrary to the training of the perceptron, the thresholds of neurons are never updated. The energy function becomes E=−

1 2

M  i,j=1;j=i

wij yi yj −

M  i=1

yi Ii +

M 

yi θi

(2.80)

i=1

Also in this case it is shown that the change of state of the network for a single update of a neuron, the energy change E is always zero or negative.

2.12.2 Application of Hopfield Network to Discrete States The initialization of the Hopfield network depends on the particular application of the network. It is normally made starting from the initial values of neurons associated with the desired pattern. After repeated updates, the network converges to a pattern attractor. Hopfield has shown that convergence is generally assured and the attractors of this nonlinear dynamic system are stable, nonperiodic or chaotic as often occurs in other systems. The most common applications are (a) Associative Memories: The network is capable of memorizing some states (local minimums of the energy function) associated with patterns which it will then remember. (b) Combinatorial optimization: in the hypothesis of a well-modeled problem, the network is able to find some local minimums and acceptable solutions but does

2.12 Recurrent Neural Networks

229

not always guarantee the optimal solution. The classic application example is the problem of Traveling Salesman’s Problem. (c) Calculation of logical functions (OR,XOR, ...). (d) Miscellaneous Applications: pattern classification, signal processing, control, voice analysis, image processing, artificial vision. Generally used as a black box to calculate some output resulting from a certain self-organization caused by the same network. Many of these applications are based on Hebbian learning.

2.12.2.1 Associative Memory-Training The Hopfield network can be used as an associative memory. This allows the network to serve as a content addressable memory (CAM), i.e., the network will converge to remember states even if it is stimulated with slightly different patterns from those that generated those states (the classic example is the recognition of a noisy character compared to the one without noise learned from the network). In this case, we want the network to be stimulated by N binary pattern vectors P = {x1 , x2 , . . . , xN } with xi = (xi1 , xi2 , . . . , xiM ) and generate N different stable states associated with these patterns. In essence, the network, once stimulated by the set of patterns P, stored their footprint by determining the synaptic weights, adequate for all connections. The determination of weights is accomplished through Hebbian non-supervised learning. This learning strategy is summarized as follows: (a) If two neurons have the same state of activation their synaptic connection is reinforced, i.e., the associated weight wij is increased. (b) If two neurons exhibit opposite states of activation, their synaptic connection is weakened. Starting with the W matrix of zero weights, applying these learning rules and presenting the patterns to be memorized, the weights are modified as follows: wij = wij + xki xkj

i, j = 1, . . . , M ; with i = j

(2.81)

where k indicates the pattern to memorize, i and j indicate the indices of the components of the binary pattern vector to be learned;  is a constant to check that the weights do not become too large or small (normally  = 1/N ). The (2.81) is iterated for all N patterns to be stored and once presented, the final weights will result wij =

N 

xki xkj

wii = 0;

i, j = 1, . . . , M ; with i = j

(2.82)

k=1

In vector form the M × M weights matrix W for the set P of the stored patterns is given by W=

 N k=1

 T xki xkj

− N I = (x1T x1 − I) + (x2T x2 − I) + · · · , (xNT xN − I) (2.83)

230

2 RBF, SOM, Hopfield, and Deep Neural Networks

where I is the identity matrix M × M that subtracted from the weights matrix W guarantees that the latter has zeroed all the elements of the main diagonal. It is also assumed that the binary responses of neurons are bipolar (+1 or −1). In the case of binary unipolar pattern vectors (that is, with values 0 or 1) the weights Eq. (2.82) is thus modified by introducing scale change and translation wij =

N  (2xki − 1)(2xkj − 1)

wii = 0;

i, j = 1, . . . , M ; with i = j (2.84)

k=1

2.12.2.2 Associative Memory-Recovery Once the P patterns are stored, it is possible to activate the network to retrieve them even if the pattern given as input to the network is slightly different from the one stored. Let s = (s1 , s2 , . . . , sM ) a generic pattern (with bipolar elements −1 and +1) to be retrieved presented to the network which is initialized with y(0) = si , i = 1, . . . , M , where the activation state of the neuron ith at the time t is indicated with yi (t). The state of the network at time (t + 1), during its convergence iteration, based on the (2.75) neglecting the threshold, can be expressed by the following recurrent equation:    M wij yj (t) (2.85) yi (t + 1) = σ j=1;j=i

where in this case the activation function σ () used is the sign function sgn() considering the bipolarity of the patterns (−1 or 1). Recall that the neurons are updated asynchronously and randomly. The network activity continues until the output of the neurons remains unchanged, thus representing the final result, namely the x pattern recovered from the set P, stored earlier in the learning phase, and that best approximates the input pattern s. This approximation is evaluated with the Hamming distance H which determines the number of different  bits between two binary patterns which in this case is calculated from H = M − i si xi . The Hopfield network can be used as a pattern classifier. The set of patterns P, in this case, are considered the prototypes of the classes and the patterns s to be classified are compared with these prototypes by applying a similarity function to decide the class to which they belong. Consider a first example with two stable states (two auto-associations) x1 = (1, −1, 1)T and x2 = (−1, 1, −1)T . The diagonal and symmetric matrix of the synaptic weights W given by the (2.83) results ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 0 −2 2 −1  1    ⎢ ⎥ W = ⎣−1⎦ 1 −1 1 + ⎣ 1 ⎦ −1 1 −1 − 2I = ⎣−2 0 −2⎦ 1 −1 2 −2 0

2.12 Recurrent Neural Networks

231

Fig. 2.14 Structure of the Hopfield network at discrete time in the example with M = 3 neurons and N = 23 = 8 states of which two are stable representing the patterns (1, −1, 1) and (−1, 1, −1)

The network is composed of 3 neurons and 23 patterns can be proposed. Presenting to the network one of the 8 possible patterns the network always converges in one of the two stable states as shown in Fig. 2.14. For example, by initializing the network with the y(0) = (1, 1, −1)T pattern and applying to the 3 neurons the (2.85), the network converges to the final state y(3) = (−1, 1, −1)T or the pattern x2 = (−1, 1, −1)T . Initializing instead with the pattern y(0) = (1, −1, −1)T the network converges to the final state y(3) = (1, −1, 1)T or the pattern x1 = (1, −1, 1)T . Figure 2.14 shows the structure of the network and all the possible convergences for the 8 possible patterns that are less than 1 bit similar to the stored prototype pattern vectors. A second example of the Hopfield network is shown, used as CAM [11], where they are stored and then recovered simple binary images of size 20 × 12 representing single numeric characters. In this case, a pattern is represented by a vector of 240 binary elements with bipolar values −1 and 1. Once stored in the network the desired patterns (training phase) are tested for the ability of the network to recover a stored pattern. Figure 2.15 shows the patterns stored and the same patterns with added random noise at 20% presented as input to the network for their retrieval. The results of the correct retrieval or not of the patterns to be recovered are highlighted. The noise added to the patterns has greatly altered the initial graphics of the numeric character (deleting and/or adding elements in the binary image). Hopfield’s self-associative memory performs the function of a decoder in the sense that it retrieves the memory content more similar to the input pattern. In this example, only the character “5” is incorrectly retrieved as character “3”.

2.12.2.3 Associative Memory-Performance Assume that the x pattern has been stored as one of the patterns of the set P of N pattern. If this pattern is presented to the network, the activation potential of the ith neuron to recover this pattern x is given by yi = neti =

M  j=1;j=i

wij xj

232

2 RBF, SOM, Hopfield, and Deep Neural Networks

Pattern riconosciuto “0”

Pattern riconosciuto “1”

Pattern riconosciuto “3”

Pattern riconosciuto “3”

Fig. 2.15 Results of numerical pattern retrieval using the Hopfield network as a CAM memory. By giving non-noisy patterns to the input, the network always retrieves the memorized characters. By giving instead noisy input patterns at 25% the network does not correctly recover only the “5” character, recognized as “3”

∼ =

M  j=1;j=i

N 

xki xkj

xj

k=1

  

for the (2.82) =wij

∼ =

N 

xki

k=1

M 

xkj xj

j=1

   ˜ M



˜ xi M

(2.86)

˜ ≤ M , being the result of the inner product between vectors with M bipolar where M binary elements (+1 or −1). If the pattern x and x are statistically independent, i.e., orthogonal, their inner product becomes zero. Alternatively, the limit case is when ˜ = M , that is, the pattern x does not the two vectors are identical, obtaining that M generate any updates and the network is stable. This stable state for the recovery of the pattern x we have, considering the Eq. (2.83), with the activation potential net given by y = net = Wx =

 N k=1

 xkT xk − M I x

(2.87)

2.12 Recurrent Neural Networks

233

Let us now analyze the two possible cases: 1. stored patterns P are orthogonal (or statistically independent) and M > N , that is  0 if i = j (2.88) xiT xj = M if i = j then we will have, developing the (2.87), the following: y = net = (x1T x1 + · · · + xNT xN − M I)x = (M − N )x

(2.89)

With the assumption that M > N , it follows as a result that x is a stable state of the Hopfield network. 2. stored patterns P are not orthogonal (or statistically independent): y = net = (x1T x1 + · · · + xNT xN − M I)x = x x1T x1 + · · · + x xNT xN − M x I N 



= (M − N )x +    stable state

(xkT xk )x

k=1;xk =x





(2.90)



noise

where the activation potential of the neuron in this case is given by the term stable state previously determined and the term of the noise. The x vector is a stable state when M > N and the noise term is very small and a concordant sign must be maintained between the activation potential y and the vector x (sgn(y) = sgn(x )). Conversely, x will not result in the stable state if the noise is dominant with respect to the equilibrium status term as happens with the increasing number of patterns N to be stored (i.e., M − N decreases). As is the case for all associative memories, the best results occur when the patterns to be memorized are represented by orthogonal vectors or very close to their orthogonality. The Hopfield network used as CAM is proved to have a memory capacity equal to 0.138 · M where M is the number of neurons (the theoretical limit is 2M pattern). The network is then able to recover the patterns even if they are noisy within a certain tolerance dependent on the application context.

2.12.3 Continuous State Hopfield Networks In 1984 Hopfield published another important scientific contribution that proposed a new neural model with continuous states [12]. In essence, the Hopfield model of discrete states (unipolar [0,1] or bipolar [−1,1]) was generalized considering the continuous state of neurons whose responses can assume continuous (real) values in

234

2 RBF, SOM, Hopfield, and Deep Neural Networks

the interval between 0 and 1 as with MLP networks. In this case, the selected activation function is the sigmoid (or hyperbolic tangent) described in the Sect. 1.10.2. The dynamics of the network remain asynchronous and the new state of the neuron ith is given by a generic monotone increasing function. Choosing the sigmoid function the state of the neuron results yi = σ (neti ) =

1 1 + e−neti

(2.91)

where neti indicates the activation potential of the neuron ith. It is assumed that the state of a neuron changes slowly over time. In this way, the change of state of the other neurons does not happen instantaneously but with a certain delay. The activation potential of the ith neuron changes over time according to the following:     M M   d neti = η −neti + wij yi = η −neti + wij σ (netj ) dt j=1

(2.92)

j=1

where η indicates the learning parameter (positive) and wij the synaptic weight between the neuron i and j. With the (2.92) a discrete approximation of the differential d (neti ) is calculated which is added to the current value of the activation potential neti which leads the neuron to the new state given by yi expressed with the (2.91). The objective is now to demonstrate how this new, more realistic network model, with continuous-state dynamics, can converge toward fixed or cyclic attractors. Hopfield to demonstrate convergence has proposed [12] a slightly different functional energy than the discrete model, given by 1 E=− 2

M 

wij yi yj +

i,j=1;j=i

M   i=1

yi

σ −1 (y)dy

(2.93)

0

At this point, it is sufficient to demonstrate that the energy of the network decreases after each update of the state of the neuron. This energy decrease is calculated with the following: M M   dyi dyi dE =− yj + (2.94) wij σ −1 (yi ) dt dt dt i,j=1;j=i

i=1

Considering the property of symmetry (that is, wij = wji ) of the network and since there exists the inverse of the function sigmoide neti = σ −1 (yi ), the (2.94) can be so simplified M  M  dyi  dE =− (2.95) wij yj − neti dt dt i=1

j=1

2.12 Recurrent Neural Networks

235

The expression in round brackets (•) is replaced considering the (2.92) and we get the following: M 1  dyi d dE =− neti (2.96) dt η dt dt i=1

d and replacing dt neti considering the inverse of the derivative of the activation function σ (using the rule of derivation of composite functions) we obtain

2  M dE 1  d =− neti σ (neti ) dt η dt

(2.97)

i=1

It is shown that the derivative of the inverse function σ is always positive considering that the sigmoid is a strictly monotone function.7 The parameter η is also positive, as well as (d (neti )/dt)2 . It can then be affirmed, considering the negative sign, that the expression of the second member of the (2.97) will never be positive and consequently the derivative of the energy function with respect to time will never be positive, i.e., the energy can never grow while the network dynamics evolves over time dE ≤0 dt

(2.98)

The (2.98) implies that the dynamics of the network is based on the energy E which is reduced or remains stable in each update. Furthermore, a stable state is reached when the (2.98) vanishes and corresponds to a attractor of the state space. This d happens when dt neti ≈ 0 ∀i, i.e., the state of all neurons does not change significantly. Convergence can take a long time since dtd neti gets smaller and smaller. It is, however, required to converge around a local minimum in a finite time.

2.12.3.1 Summary The Hopfield network has been an important step in advancing knowledge on neural networks and has revitalized the entire research environment in this area. Hopfield established the connection between neural networks and physical systems of the type considered in statistical mechanics. Other researchers had already considered the associative memory model in the 1970s more generally. With the architecture of the symmetrical connection network with a diagonal zero matrix, it was possible to design recurrent networks of stable states. Furthermore, with the introduction of the 7 Recall the nonlinearity property of the sigmoid function σ (t) ensuring the limited definition range. In this context the following property is exploited

d σ (t) = σ (t)(1 − σ (t)) dt a polynomial relation between derivative and function itself very simple to calculate.

236

2 RBF, SOM, Hopfield, and Deep Neural Networks

concept of energy function, it was possible to analyze the convergence properties of the networks. Compared to other models the Hopfield network has a simpler implementation also hardware. The strategy adopted for the updating of the state of the network corresponds to the physical methods of relaxation which from a perturbed state a system is placed in equilibrium (stable state). The properties of the Hopfield network have been studied by different researchers both from the theoretical and implementation point of view including the realization of optical and electronic devices.

2.12.4 Boltzmann Machine In the previous section, we described the Hopfield network based on the minimization of the energy function without the guarantee of achieving a global optimization even if once the synaptic weights have been determined the network spontaneously converges in stable states. Taking advantage of this property of the Hopfield network it was possible to introduce variants to this model to avoid the problem of local minima of the energy function. At the conceptual level, we can consider that the network reaches a state of minimum energy but it could accidentally jump into a higher state of energy. In other words, a stochastic variant of network dynamics could help the network avoid a local minimum of the energy function. This is possible through the best known stochastic dynamics model known as Boltzmann Machine (BM). A neural network based on the BM [13] can be seen as the stochastic dynamic version of the Hopfield network. In essence, the deterministic approach for the update of the weights used for the Hopfield network (Eq. 2.74), is replaced with the stochastic approach that updates the state yi of the neuron ith, always in asynchronous mode, but according to the following rule:  1 with probability pi (2.99) yi = 0 with probability (1 − pi ) where pi =

1 = 1 + eEi /T

1   M 1 + exp −( j=1 wij yj − θi )/T

(2.100)

with wij are indicated the synaptic weights, θi denotes the term bias associated with the neuron ith and T is the parameter that describes the temperature of the BM network (simulates the temperature of a physical system). This latter parameter is motivated by statistical physics whereby neurons normally enter a state that reduces the system energy E, but can sometimes enter a wrong state, as well as a physical system sometimes (but not often) can visit states corresponding to higher energy

2.12 Recurrent Neural Networks

237

values.8 With the simulated annealing approach considering that the connections between neurons are symmetrical for all (wij = wji ), then each neuron calculates the energy gap Ei as the difference between energy of the inactive state E(yi = 0) and the active one E(yi = 1). Returning to the (2.100) we point out that for low values of T we have   E −→ 0 =⇒ pi −→ 1 T −→ 0 =⇒ exp − T having assumed that E > 0. This situation brings the updating rule back to the deterministic dynamic method of the Hopfield network of the discrete case. If instead T is very large, as happens at the beginning of activation of the BM network, we have   E −→ 1 =⇒ pi −→ 0.5 T −→ ∞ =⇒ exp − T which indicates the probability of accepting a change of state, regardless of whether this change leads to benefits or disadvantages. In essence, with T large the BM network explores all states. As for the Hopfield network, the energy function of the Boltzmann machine results E=−

1 2

M  i,j=1;j=i

wij yi yj +

M 

yi θi

(2.101)

i=1

The network update activity puts the same in the local minimum configuration of the energy function, associating (memorizing) the patterns in the various local minima. This occurs with BM starting with high-temperature values and then through the simulated annealing process it will tend to stay longer in attraction basins with deeper minimum values, and there is a good chance of ending in a global minimum. The BM network compared to Hopfield’s also differs because it divides neurons into two groups: visible neurons and hidden neurons. The visible ones interface with the external environment (they perform the function of input/output) while the hidden ones are used only for internal representation and do not receive signals from the

8 This

stochastic optimization method that attempts to find a global minimum in the presence of local minima, is known as simulated annealing. In essence, the physical process that is adopted in the heat treatment (heating then slow cooling and then fast cooling) of the ferrous materials is simulated to make them more resistant and less fragile. At high temperatures the atoms of these materials are excited but during the slow cooling phase they have the time to assume an optimal crystalline configuration such that the material is free of irregularities reaching an overall minimum. This heat quenching treatment of the material can avoid local minimum of the lattice energy because the dynamics of the particles include a temperature-dependent component. In fact, during cooling, the particles lose energy but sometimes they acquire energy, thus entering states of higher energy. This phenomenon avoids the system from reaching less deep minimums.

238

2 RBF, SOM, Hopfield, and Deep Neural Networks

outside. The state of the network is given by the states of the two types of neurons. The learning algorithm has two phases. In the first phase the network is activated keeping the visible neurons blocked with the value of the input pattern and the network tries to reach a thermal balance toward the low temperatures. Then we increase the weights between any pair of neurons that both have the active state (analogy to the Hebbian rule). In the second phase, the network is freely activated without blocking the visible neurons and the states of all neurons are determined through the process of simulated annealing. Reached (perhaps) the state of thermal equilibrium toward the low temperatures there are sufficient samples to obtain reliable averages of the products yi yj . It is shown that the learning rule [13–15] of the BM is given by the following relation:    η   yi yj fixed − yi xj free (2.102) wij = − T   where η is the learning parameter; yi yj fixed denotes the expected average value of the product  of neuron states ith and jth during the training phase with blocked visible neurons; yi xj free denotes the expected average value of the product of neuron states ith and jth when the net is freely activated without the block of visible neurons. In general, the Boltzmann machine has found widespread use with good performance in applications requiring decisions on stochastic bases. In particular, it is used in all those situations where the Hopfield network does not converge in the global minimums of the energy function. Boltzmann’s machine has had considerable importance from a theoretical and engineering point of view. One of its architecturally efficient versions is known as Restricted Boltzmann Machine-RBM [16]. An RBM network considers distinct the set of neurons visible from those hidden with the particularity that no connection is expected between neurons of the same set. Only connections between visible and hidden neurons are expected. With these restrictions the hidden neurons are independent of each other and depend only on visible neurons allowing the calculation, in a single parallel step, of the expected  average value of the product yi yj fixed of the states of the neurons ith and jth during the   training phase with blocked visible neurons. The calculation instead of the yi xj free products requires several iterations that involve parallel updates of all visible and hidden neurons. The RBM network has been applied with good performance for speech recognition, dimensionality reduction [17], classification, and other applications.

2.13 Deep Learning In the previous paragraphs we have described several applied neural network architectures for machine learning and, above all, for supervised and non-supervised automatic classification, based on the back-propagation method, commonly used in

2.13 Deep Learning

239

conjunction with an algorithm to optimize the rapid descent of the gradient to update the weights of the neurons by calculating the gradient of the cost or target function. In recent years, starting in 2000, with the advancement of Machine Learning research, the neural network sector has vigorously recovered through further availability of low-cost multiprocessor computing systems (computer clusters, processors GPU graphs, ...), the need to elaborate large amounts of data (big data) in various applications (industries, public administration, social telematic sector, ...), and above all through the development of new algorithms for automatic learning applied for artificial vision [18,19], speech recognition [20], textual and language [21]. This advancement of research on neural networks has led to the development of new algorithms for machine learning-based also on the architecture of traditional neural networks developed since the 1980s. The strategic progress was with the best results achieved with the new neural networks developed, called Deep Learning-DL [22]. In fact, with Deep Learning it is intended as a set of algorithms that use neural networks as a computational architecture that automatically learn the significant features from the input data. In essence, with the Deep Learning approach neural networks have improved the limitations of traditional neural networks to approximate nonlinear functions and to automatically resolve the extraction of feature in complex application contexts with large amounts of data, exhibiting excellent adaptability (recurrent neural networks, convolutional neural networks, etc.). The meaning of deep learning is also understood in the sense of the multiple layers involved in architecture. In more recent years, networks with deeper learning are based on the use of rectified linear units [22] as an activation function in place of the traditional sigmoid function (see Sect. 1.10.2) and regularization technique (dropout) [20] to improve or reduce the problem overfitting in neural networks (already described in Sect. 1.11.5).

2.13.1 Deep Traditional Neural Network In Sect. 1.11.1 we have already described the MLP network which in fact can be considered a deep neural network as it can have more than two intermediate layers of neurons, between input and output, even if it is among the traditional neural networks being the neurons completely connected between adjacent layers. Therefore, even the MLP network using many hidden layers can be defined as deep learning. But with the learning algorithm based on the backpropagation increasing the number of hidden layers become increasingly difficult the process of adapting significant weights, even if from a theoretical point of view it would be possible. Unfortunately, this is found experimentally when the weights are randomly initialized and subsequently, during the training phase, with the backpropagation algorithm, the error is backpropagated from the right (from the output layer) to the left (towards the input layer) by calculating the partial derivatives with respect to each weight, moving in the opposite direction of the gradient of the objective function. This weight-adjustment process becomes more and more complex as the number of

240

2 RBF, SOM, Hopfield, and Deep Neural Networks

hidden layers of neurons increases as the value of the weight update and the objective function can not converge toward optimal values for the given data set. Consequently, the MLP with the backpropagation process, in fact, does not optimally parametrize the traditional neural network although deep in terms of the number of hidden layers. The limits of learning with backpropagation derive from the following problems: 1. There is not always available prior knowledge about the data to be classified especially when dealing with large volumes of data such as color images, variant space-time image sequences, multispectral images, etc. 2. The learning process, based on backpropagation, tends to lock up in local minima using the gradient descent method especially when the layers with totally connected neurons are many. Problem known as vanish gradient problem. The process of adaptation of the weights through activation functions (log-sigmoid or hyperbolic tangent) tends to cancel itself. Even if the theory of adaptation of the weights with the error back-propagation with the gradient descent is mathematically rigorous, it fails in practice, since the gradient values (calculated with respect to weights), which determine how much each neuron should change, they get smaller and smaller as they are propagated backward from the deeper layers of the neural network.9 This means that the neurons of the previous levels learn very slowly compared to the neurons in the deeper layers. It follows that the neurons of the first layers learn less and more slowly. 3. The computational load becomes noticeable as the depth of the network and the data of the learning phase increase. 4. The network optimization process becomes very complex as the hidden layers increase significantly. A solution to the overcoming of the exposed problems is given by the Convolutional Neural Networks-CNN [23,24] that we will introduce in the following paragraph.

2.13.2 Convolutional Neural Networks-CNN The strategy of a Deep Learning-DL network is to adopt a deep learning process using intelligent algorithms that learn feature through deep neural networks. The idea is to extract from the input dataset feature significant, useful to use then in the learning phase from the DL network. The quality of the learning model is strongly linked to the nature of the input data. If the latter have intrinsically little discriminating information, the performance of

9 We

recall that the adaptation of the weights in the MLP occurs by deriving the objective function that uses the sigmoid function with the value of the derivative less than 1. Application of the chain rule derivation leads to multiplying many terms less than 1 with the consequent problem of considerably reducing the gradient values as you proceed toward the layers furthest from the output.

2.13 Deep Learning

241

Fig. 2.16 Example of input volume convolution of H × W × D with multiple filters h × w × D. Placing a filter on the input volume generates a single scalar. Translating the filter for the whole image, then implementing the convolution operator, generates a single map (activation map) of reduced size (if padding and stride is not implemented as discussed below). The figure shows that applying more filters, and therefore, more convolutions to the input volume, there will be more activation maps

the automatic learning algorithms are negatively influenced by operating with little significant extracted features. Therefore, the goal behind CNN is to automatically learn the significant features from input data that is normally affected by noise. In this way, significant features are provided to the deep network so that they can learn more effectively. We can think of deep learning as algorithms for the automatic detection of features to overcome the problem of descent of the gradient in nonoptimal situations and facilitate the learning with a network with many hidden layers. In the context of the classification of images, CNN uses the concept of ConvNet, in fact, a neural convolution that uses windows (convolution masks), for example, 5 × 5 dimensions, thought as receptive fields associated with the neurons of the adjacent layer, which slide on the image to generate a feature map, also called the activation map, which is propagated to the next layers. In essence, the convolution mask of fixed dimensions will be the object of learning, based on the data supplied as input (sequence of pairs contained in the dataset—images/label). Let’s see how a generic convolution layer is implemented. Starting from a volume of size H × W × D, where H is the height, W is the witdh, and D indicates the number of channels (which in the case of a color image D = 3), we define one or more convolution filters of dimensions h × w × D. Note that the number of filter channels and the input volume must be the same size. Figure 2.16 shows an example of convolution. The filter generates a scalar when positioned on a single volume location, while it’s sliding over the entire volume (implementing the convolution operator) of the input generates a new image called the activation map.

242

2 RBF, SOM, Hopfield, and Deep Neural Networks

The activation map will have high values, where the structures of the image or input volume are similar to the filter coefficients (which are the parameters to be learned), and low where there is no similarity. If x is the image or input volume, let w denote the filter parameters and with b the bias, the result obtained is given by w x + b. For example, for a color image with D = 3, and a filter consisting of the same height and width equal to 5, we have a number of parameters equal to 5 × 5 × 3 = 75 plus the bias b, for a total of 76 parameters to learn. The same image can be analyzed with K filters, generating several different activation maps, as shown in Fig. 2.16. A certain analogy between the local functionality (defined by its receptive field) of biological neurons (simple and complex cells) described in Sect. 4.6.4 for capturing movement and elementary patterns in the visual field, and the functionality of neurons in the convolutional layer of CNN. Both neurons propagate very little information in successive layers in relation to the area of the receptive field, greatly reducing the number of total connections. In addition, the spatial information of the visual field (retinal map and feature map) propagated in subsequent layers is preserved in both. Let us now look at the multilayered architecture of a CNN network and how the network evolves from the convolutional layer, which for clarity we will refer to a CNN implemented for the classification.

2.13.2.1 Input Layer Normally it is the input image with dimensions W0 × H0 × D0 . It can be considered as the input volume with D0 number of channels corresponding, for example, to an RGB color image where D0 = 3. 2.13.2.2 Convolutional Layer A CNN network is composed of one or more convolutional layers generating a hierarchy of layers. The first volume of output (first convolutional layer) is obtained with the convolution process, schematized in Fig. 2.16, directly linked to the pixels of the input volume. Each neuron of the output volume is connected to w × w × D0 input neurons where w is the size of the convolution square mask. The entire output volume is characterized by the following parameters: 1. Number K of convolution filters. Normally chosen as a power number of 2, for example, 4, 8, 16, 32, ..., and so on. 2. Dimension w × w of the convolution mask. It normally takes values of w = 3, 5, . . .. 3. Stride. S translation step of the 3D mask on the input volume (in the case of color images D0 = 3). In normal convolutions it is 1 while S = 2 for values of w > 3. This parameter affects the 2D reduction factor of the activation maps and consequently the number of connections. 4. Padding. Border size to be added (with zeros) to the input image to manage the convolution operator at the edges of the input volume. The size of the border is

2.13 Deep Learning

243

defined with the padding parameter P which affects the size of the 2D base of the output volume or the size of the activation maps (or even feature maps), as follows: W1 =

(W0 − w + 2P) +1 S

H1 =

(H0 − w + 2P) +1 S

(2.103)

For P = 0 and S = 1 the size of the feature maps results W1 = H1 = W0 − w + 1. The parameters defined above are known as hyperparameters of the CNN network.10 Defined as the hyperparameters of the convolutional layer we have the following dimensions of the feature volume: W1 × H1 × D1 where W1 and H1 are given by the Eq. (2.103) while D1 = K coincident with the number of convolution masks. It should be noted that the number of weights shared for each feature map is w × w × D0 for a total of w × w × D0 weights for the entire volume of the features. For a 32 × 32 × 3 RGB image and 5 × 5 × 3 masks we would have for each neuron of the feature map a value resulting from the convolution given by the innerl product of size 5 × 5 × 3 = 75 (convolution between 3D input volume and 3D convolution mask, neglecting the weight bias). The corresponding activation map would be W1 × H1 × D1 = 28 × 28 × K assuming P = 0 and S = 1. For K = 4 masks we would have between the input layer and the convolutional layer a total of connections equal to (28 × 28 × 4) × (5 × 5 × 3) = 235200 and a number total weights (5 × 5 × 3) · 4 = 300, plus the 4 biases b and there are a number of parameters to be learned equal to 304. In this example, for each portion of size (5 × 5 × 3) of the input volume, as extended as the size of the mask, 4 different neurons observe the same portion by extracting 4 different features.

2.13.2.3 ReLU Layer As seen above, the activity level of each convolutional filter is measured with net = w x + b, which is supplied as input to an activation function. We have previously talked about sigmoid neurons, in which the sigmoid defined in [−∞, ∞] → [0, 1] 10 In the field of machine learning, there are two types of parameters, those that are learned during the

learning process (for example, the weights of a logistic regression or a synaptic connection between neurons), and the intrinsic parameters of an algorithm of a learning model whose optimization takes place separately. The latter is known as the hyperparameters, or optimization parameters associated with a model (which can be a regularization parameter, the depth of a decision tree, or how in the context of deep neural networks the number of neuronal layers and other parameters that define the architecture of the network).

244

2 RBF, SOM, Hopfield, and Deep Neural Networks

10

10

9

8

8 7

6 6 5

4

4

2

3 2

0 1 0 −10

−8

−4 −6

−2

0

2

4

6

8

−2 −10

10

−8

−6

−4

−2

0

2

4

6

8

10

b)

a) 10

8

6

4

2

0

−2 −10

−8

−6

−4

−2

0

2

4

6

8

10

c) Fig. 2.17 a Rectifier Linear Unit ReLU. b Leaky ReLU; c exponential Linear Units ELU

presents some problems when the neurons go into saturation. The sigmoid resets the gradient and does not allow to modify the weights of the net. Furthermore, its output is not centered with respect to 0, the exponential is slightly more computationally complex, and slows down the process of the gradient descent. Subsequently, more efficient activation functions have been proposed such as the ReLU [19] of which we also describe some variants. ReLU.

The ReLU activation function (Rectified Linear Units) is defined by  net if net ≥ 0 f (net) = (2.104) 0 if net < 0

It performs the same function as the sigmoid used in MLPs but has the characteristic of increasing the nonlinearity property and eliminates the problem of canceling the gradient highlighted previously (see Fig. 2.17a). There is no saturation for positive values of the derivative while it is zero for negative or null values of net and 1 for positive values. This layer is not characterized by parameters and the convolution volume remains unchanged in size. Besides being computationally less expensive, it allows in the training phase to converge much faster than the sigmoid and is more biologically plausible.

2.13 Deep Learning

LeakyReLU.

245

Proposed in [25], defined by  net if net ≥ 0 f (net) = 0.01 · net if net < 0

As shown in Fig. 2.17b does not cancel when net becomes negative. It has the characteristic of not saturating neurons and does not reset the gradient. ParametricReLU. Proposed in [26], defined by  net if net ≥ 0 f (net, α) = α · net if net < 0 It is observed that for α ≤ 1 implies f (net) = max(net, α · net). As we can see in Fig. 2.17a is not canceled when net becomes negative. It has the same characteristics as the previous function, plus it improves the fitting model at a low computational cost and reduces the risk of overfitting. The arbitrary hyperparameter α can be learned since we can backpropagate into it. In this way it gives neurons the possibility to choose the best slope in the negative region. ExponentialLinearU nits(ELU ). Defined as follows:  net net > 0 (2.105) f (net, α) = α(e(net) − 1) net ≤ 0 This has been proposed to improve speedup in the convergence phase, and to alleviate the problem of zeroing the gradient [27]. In the negative part, it is not annulled as for Leaky ReLU (see Fig. 2.17c).

2.13.2.4 Pooling Layer It is applied downstream of the convolution, or on the activation maps calculated with the functions described above, and has the objective of subsampling each map (see Fig. 2.18a). In essence, it operates on every dth map of activation independently. The most used information aggregation operator is the maximum (Max) (see Fig. 2.18b) or the average (Avg) applied locally on each map. These aggregation operators exhibit the characteristic of being invariant with respect to small feature translations. The pooling has no parameters, but only hyperparameters that are identified in the size of the windows on which to apply the operators Max or Avg, and the stride. The affected volume of size W1 × H1 × D1 is reduced in the pooling layer in the following dimensions: W2 × H2 × D2 where W2 =

W1 − w +1 S

H2 =

H1 − w +1 S

D2 = D1

(2.106)

246

2 RBF, SOM, Hopfield, and Deep Neural Networks

Fig. 2.18 a Subsampling of the activation maps produced in the convolution layer. b Operation of max pooling applied to the linearly adjusted map (ReLU) of size 4 × 4, pooling window 2 × 2 and stride = 2. We get a map of dimensions 2 × 2 in output

These relations are analogous to the (2.103) but in this case, for the pooling, there is no padding, and therefore, the parameter P = 0. Normally the values of the parameters used are (w, S) = (2, 2) and (w, S) = (3, 2). It is observed that also in this layer there are no parameters to learn as in the ReLU layer. An example of max pooling is shown in Fig. 2.18b.

2.13.2.5 Full Connected Layer So far we have seen how to implement convolutional filters, activation maps using functions, and subsampling operations using pooling which generates small activation maps (or features). In a CNN, the size of the volumes is reduced for each level of depth (variables H and W ) but generally they become deeper in terms of the number of channels D. Usually, the output of the last convolution layer is provided in layered input Fully Connected (FC). The FC layer is identical to any layer of a traditional deep network where the neurons of the previous layer are totally connected to those of the following layer. The goal of the FC layer is to use the features determined by the convolutional layer to classify the input image into various classes based on the training set input images. The volume of the FC layer is 1 × 1 × K where in this case K, if it is the last output layer, indicates the number of classes of the data set to be classified. Therefore, K is the number of neurons totally connected with all the neurons of the previous FC layer or volume. The activation of these neurons occurs with the same formalism as traditional networks (see Sect. 1.10.3). In summary, a CNN network starting from the input image extracts the significant features through the convolution layer, organizing them in the convolutional volume that includes many feature maps for how many convolution masks are. Then follows a ReLU layer that applies an activation function generating the activation volume of the same size as the convolutional one and a pooling layer to subsample the feature maps. These three layers Conv, ReLU, and Pool can be replicated in cascade, depending on the size of the input volume and the level of reduction to be achieved, ending with

2.13 Deep Learning

247

Fig. 2.19 Architectural scheme of a CNN network very similar to the original LeNet network [24] (designed for automatic character recognition) used to classify 10 types of animals. The network consists of ConvNet components for extracting the significant features of the input image, given as input to the traditional MPP network component, based on three fully connected FC layers, which performs the function of classifier

the FC layer representing a volume of size 1 × 1 × C completely connected with all the neurons of the previous layer where the number of neurons C represents the classes of the data set under examination. In some publications, the sequence Conv, ReLU, and Pool is considered to be part of the same level or layer, therefore, the reader should be careful not to make confusion. The Conv, ReLU, and Pool layers thus defined make the CNN network invariant with small transformations, distortions, and translation in the input image. A slight distortion of the input image will not significantly change the feature maps in the pooling layer since the avg and max operator is applied on small local windows. We, therefore, obtain an almost invariant representation in the image scale (equivariant representation). It follows that we can detect objects in the image, regardless of their position. Figure 2.19 shows an architectural scheme of a CNN network (used to classify 10 types of animals) very similar to the original LeNet network (designed for automatic character recognition) composed of 2 convolutional layers and two layers of max pooling, arranged alternately, and by the fully connected layers (FC) that make up a traditional MPP network (layers FC1, FC2, and the logistic regression layer). As shown in the figure, giving the image of the coala input, the network recognizes it by assigning the highest probability to the coala class among all the 10 types of animals for which it was previously trained. As will be seen below, the out vector of the output layer is characterized in such a way that the sum of all the probabilities of the classes corresponds to 1. The network tries to assign the highest probability to the coala class for the input image presented in the figure.

248

2 RBF, SOM, Hopfield, and Deep Neural Networks

2.13.3 Operation of a CNN Network Having analyzed the architecture of a CNN network, we now see its operation in the context of classification. In analogy to the operation of a traditional MLP network, the CNN network envisages the following phases in the forward/backward training process: 1.

2.

3.

4.

Dataset normalization. Images are normalized with respect to the average. There are two ways to calculate the average. Suppose we have color images, so with three channels. In the first case, the average is calculated from each channel, and it is subtracted from the pixels of the corresponding channels. In this case, the average is a scalar referred to the R, G, and B band of the single image. In the second case, all the training images of the dataset are analyzed, calculating an average color image, which is then subtracted pixel by pixel from each image of the dataset. In the latter case, the average image must be remembered in order to remove it from a test image. Furthermore, there are techniques of data augmentation which serve to increase the cardinality of the dataset and to improve the generalization of the network. These techniques modify each image by turning it upside down in both axes, generating rotated versions of them, inserting slight geometric and radiometric transformations. Initialization. The hyperparameter values are defined (size w of the convolution masks, number of filters K, stride S, padding P) related to all the layers Conv, Pool, ReLU and initialized according to the Xavier method. The Xavier initialization method introduced in [28] is an optimal way to initialize weights with respect to activation functions, avoiding convergence problems. The weights are initialized as follows: ! 1 1 (2.107) Wi,j = U − √ , √ n n with U uniform distribution and n the number of connections with the previous layer (the columns of the weight matrix W). The initialization of weights is a fairly active research activity, other methods are discussed in [29–31]. Forward propagation. The images of the training set are presented as input (input volume W0 × H0 × D0 ) to the layer Conv which generates the feature maps subsequently processed in a cascade from the layers ReLU and Pool, and finally given to the FC layer that calculates the final result of the classes. Learning error calculation. FC evaluates the learning error if it is still above a predefined threshold. Normally the C classes vector contains the labels of the objects to be classified in terms of probability and the difference between the target probabilities and the current output to FC is calculated with a metric to estimate this error. Considering that in the initial process the weights are random, surely the error will be high and the process of retro-propagation of the error is triggered as in the MLPs.

2.13 Deep Learning

5.

6.

249

Backward propagation. FC activates the algorithm of backpropagation, as happens for a traditional network, calculating the gradient of the objective function (error) with respect to the weights of all FC connections using the gradient descent to update all the weights and minimize the objective function. From the FC the network error to the previous layer is retropropagated. It should be noted that all the parameters associated with learning (for example, weights) are updated, while the parameters that characterize the CNN network as the number and size of masks are set at the beginning and cannot be modified. Reiterate from phase 2.

With the learning process CNN has calculated (learned) all the optimal parameters and weights to classify an image of the training set in the test phase. What happens if an image never before seen in the learning phase is presented to the network? As with traditional networks, an image that has never been seen can be recognized as one of the images learned if it presents some similarity, otherwise it is hoped that the network will not recognize it. The learning phase, in relation to the available calculation system and the size of the image dataset, can also require days of calculation. The time required to recognize an image with the test phase is a few milliseconds, but it always depends on the complexity of the model and the type of GPU you have. The level of generalization of the network in recognizing images also depends on the ability to construct a CNN network with an adequate number of convolutional layers (Conv, ReLU, Poll) replicated and FC layers, the MLP component of a CNN network. In image recognition applications it is strategic to design the convolutional layers capable of extracting as many features as possible from the training set of images using optimal filters to extract in the first layers Conv the primal sketch information (for example, lines, corners) and in subsequent Conv layers use other types of filters to extract higher level structures (shapes, texture). The monitoring of the parameters during the training phase is of fundamental importance, it allows to avoid continuing with the training due to an error in the model or in the initialization of the parameters or hyperparameters, and to take specific actions. It is useful to divide the dataset into three parts when it is very large, leaving, for example, 60% for training, 20% of the data for the validation phase and the remaining 20% for the test phase to be used only at the end when with the training and validation partitions found the best model.

2.13.3.1 Stochastic Gradient Descent in Deep Learning So far, to update the weights and generally the parameters of the optimization model we have used the method of the gradient descent-GD. In machine learning problems, where the training set dataset is very large, the calculation of the cost function and the gradient can be very slow, requiring considerable memory resources and computational time. Another problem with the batch optimization methods is that they do not provide an easy way to incorporate new data. The optimization method based on the Stochastic Gradient Descent—SGD [32] addresses these problems following the negative gradient of the objective function after processing only a few examples of

250

2 RBF, SOM, Hopfield, and Deep Neural Networks

the training set. The use of SGD in neural network settings is motivated by the high computational cost of the backpropagation procedure on the complete training set. SGD can overcome this problem and still lead to rapid convergence. The SGD technique is also known as an incremental gradient descent, which uses a stochastic technique to optimize a cost function or objective function—the loss function. Consider having a dataset (xi , yi )N i=1 , consisting of image pairs and membership class, respectively. The output of the network in the final layer is given by the score generated by the jth neuron for a given input xi from fj (xi , W) = W xi + b. Normalizing each output of the output layer we obtain values that can be traced back to probabilities as follows for the neuron kth e fk P(Y = k|X = xi ) =  f j je

(2.108)

which is known as the Softmax function, whose outputs of each output neuron are in the range [0.1]. We define the cost function for the ith pair of the dataset " # e fyi (2.109) Li = − log  f j je and the objective function on the whole training set is given by L(W) =

N 1  Li (f (xi , W), yi ) + λR(W). N

(2.110)

i=1

The first term of the total cost function is calculated on all the pairs supplied as input during the training process, and evaluates the consistency of the model contained in W up to that time. The second term is regularization, penalizes values of W very high, and prevents the problem of overfitting, i.e., CNN behaves too well with training data but then will have poor performance with data never seen before (generalization). The R(W) function can be one of the following:  R(W) = Wkl2 k

R(W) =

l

 k

|Wkl |

l

or a combination of the two with a β hyperparameter, obtaining:  R(W) = βWkl2 + |Wkl |. k

l

For very large training datasets, the stochastic technique of the gradient descent is adopted. That is, we approximate the sum with a subset of the dataset, named

2.13 Deep Learning

251

minibatch, consisting of pairs of training chosen randomly in a number that is generally a power of 2, then 32, 64, 128 or 256, depending on the available memory capacity in GPUs in use. That said, instead of calculating the gradient descent ∇L(W), the stochastic gradient descent ∇W L(W) for the minibatch samples is instead calculated as follows: ∇W L(W) =

N 1  ∇W Li (f (xi , W), yi ) + λ∇W R(W). N

(2.111)

i=1

where the minibatch stochastic gradient descent ∇W L(W) is an undistorted estimator of the gradient ∇L(W). The update of the weights in the output layer is given by W := W − η∇W L(W)

(2.112)

with the sign “−” to go in the direction of minimizing the cost function, and η represents the learning parameter (learning rate or step size). The gradient of the cost function is propagated backward through the use of representation of computational graphs that make the weights in each layer of the network simple to update. The discussion of computational graphs, are not covered in the text, as well as variants to SGD integrating the moment (discussed in the previous chapter), or adaptive gradient (ADAGRAD) [33], root mean square propagation (RMSProp) [34], adaprive moment estimator (ADAM) [35] and Kalman based Stochastic Gradient Descent (kSGD) [36].

2.13.3.2 Dropout The problem of overfitting we had already highlighted for traditional neural networks and to reduce it we have used heuristic methods and verified the effectiveness (see Sect. 1.11.5). For example, an approach used was to monitor the learning phase and block it (early stop) when the values of some parameters were checked with a certain metric and thresholds (a simple way was to check the number of epochs). Validation datasets (not used during network training) were also used to verify the level of generalization of the network as the number of epochs changed and to assess whether the network’s accuracy improved or deteriorated. For deep neural networks with a much larger number of parameters, the problem of overfitting becomes even more serious. A regularization method, to improve or reduce the problem of overfitting in the context of deep neural networks, is known as dropout proposed in [37] in 2014. The dropout method together with the use of rectified linear units (ReLUs) are two fundamental ideas for improving the performance of deep networks.

252

2 RBF, SOM, Hopfield, and Deep Neural Networks

Fig. 2.20 Neural network with dropout [37]. a A traditional neural network with 2 hidden layers. b An example of a reduced network produced by applying the dropout heuristic to the network in (a). Neural units in gray have been excluded in the learning phase

The key idea of the dropout method 11 is to randomly exclude neural units (together with their connections) from the neural network during each iteration of the training phase. This prevents neural units from co-adapting too much. In other words, the exclusion refers to ignoring, during the training phase, a certain set of neurons that are chosen in a stochastic way. Therefore, these neurons are not considered during a particular propagation forward and backward. More precisely, in each training phase, the individual neurons are either abandoned from the network with probability 1 − p or maintained with probability p, so as to have a reduced neural network (see Fig. 2.20). It is highlighted that a neuron excluded in an iteration can be active in the subsequent iteration of its own because the sampling of neurons occurs in a stochastic manner. The basic idea to exclude or consider a neural unit is to prevent an excessive adaptation at the expense of generalization. In other words, avoid a highly trained network on the training dataset which then shows little generalization with the test or validation dataset. To avoid this, dropout expects stochastically excluding neurons in the training phase, causing their contribution to the activation of neurons in the downstream layer to be temporarily reset in the forward propagation, and

11 The

heuristic of the dropout is better understood through the following analogy. Let’s imagine that in a patent office the expert is a single employee. As often happens, if this expert is always present all the other employees of the office would have no incentive to acquire skills on patent procedures. But if the expert decides every day, in a stochastic way (for example, by throwing a coin), whether to go to work or not, the other employees, unable to block the activities of the office, are forced, even occasionally, to adapt by acquiring skills. Therefore, the office cannot rely only on the only experienced employee. All other employees are forced to acquire these skills. Therefore, a sort of collaboration between the various employees is generated, if necessary, without the number of the same being predefined. This makes the office much more flexible as a whole, increasing the quality and competence of employees. In the jargon of neural networks, we would say that the network generalizes better.

2.13 Deep Learning

253

consequently, any weights (of the synaptic connections) are not then updated in the backward propagation. During the learning phase of the neural network the weights of each neuron intrinsically model an internal state of the network on the basis of an adaptation process (weight update) that depends on the specific features providing a certain specialization. Neurons in the vicinity rely on this specialization which can lead to a fragile model that is too specialized for training data. In essence, dropout tends to prevent the network from being too dependent on a small number of neurons and forces each neuron to be able to operate independently (reduces co-adaptations). In this situation, if some neurons are randomly excluded during the learning phase, the other neurons will have to intervene and attempt to manage the representativeness and prediction of the missing neurons. This leads to more internal, independent representations learned from the network. Experiments (in the context of supervised learning in vision, speech recognition, document classification, and computational biology) have shown that with such training the network becomes less sensitive to specific neuron weights. This, in turn, translates into a network that is able to improve generalization and reduces the probability of over-adaptation with training data. The functionality of the dropout is implemented by inserting in a CNN network a dropout layer between any layer of the network. Normally it is inserted between the layers that have a high number of hyperparameters and in particular between the FC layers. For example, in the network of Fig. 2.19 a layer dropout can be inserted immediately before the FC layers with the function of randomly selecting neurons to be excluded with a certain probability (for example, p = 0.25 and this implies that 1 neuron on 4 inputs it will be randomly excluded) in each forward step (zero contribution to the activation of neurons in the downstream layer) and in the backward propagation phase (weight update). The dropout feature is not used during the network validation phase. The hyperparameter p must be adapted starting from a low probability value (for example, p = 0.2) and then increment it up to p = 0.5 with the foresight to avoid very high values for not compromising the learning ability of the network. In summary, we have that the dropout method forces a neural network to learn more robust features that are useful in conjunction with many different random subsets of other neurons. The weights of the net were learned in conditions such that a part of the neurons were temporarily excluded and when using the net in the test phase the number of neurons involved will be greater. This tends to reduce the overfitting since the network trained with the dropout layer actually turns out to be a sort of media of different networks that could potentially present an over-adaptation even if with different characteristics, and in the end with this heuristic it is hoped to reduce (mediating) the phenomenon of overfitting. In essence, it would reduce the co-adaptation of neurons, from the moment when a neuron, in the absence of other neurons exhibit a different forced adaptation in the hope of capturing (learning) more significant features, useful together with different random subgroups of other neurons. Most CNN networks insert the dropout layer. A further widespread regularization method is known as Batch Normalization described in the following paragraph.

254

2 RBF, SOM, Hopfield, and Deep Neural Networks

2.13.3.3 Batch Normalization A technique recently developed by Ioffe and Szegedy [38] called Batch Normalization alleviates the difficulties of the correct initialization of neural networks by explicitly forcing the activations to assume a unitary Gaussian distribution in the training phase. This type of normalization is simple and differentiable. The goal is to prevent the early saturation of nonlinear activation functions such as the sigmoid function or ReLU, ensuring that all input data is in the same range of values. In the implementation, the BatchNorm (BN) layer is inserted immediately after the fully connected FC levels (or convolutional layers), and normally before the nonlinear activation functions. Normally each neuron of the hidden layer lth receive in input x(l) the output of the neurons of the previous layer (l − 1)th, obtaining the activation signal net (l) (x) which is then transformed by the nonlinear activation function net (l) (x) (for example, the sigmoid or ReLU function) given by   y(l) = net (l) (x) = f (l) W(l) x(l−1) + b(l) Instead, inserting the normalization process activates the BN function before the activation function by obtaining ⎛ ⎞   ⎜ (l) (l−1) ⎟ yˆ (l) = f (l) BN (net (l) (x) = f (l) ⎝BN (W x )⎠   y(l)

where in this case the bias b is ignored for now as, as we shall see, its role will be played by the parameter β of BN. In reality the normalization process could be done throughout the training data set, but used together with the stochastic optimization process described above, it is not practical to use the entire dataset. Therefore, the normalization is limited to each minibatch B = {x1 , . . . , xm } of m samples during the stochastic network training process. For a layer of the network with input y = (y(1) , . . . y(d ) ) at d -dimensions, according to the previous equations, the relative normalization for each kth feature yˆ (k) is given by y(k) − E[x(k) ] yˆ (k) = & Var[y(k) ]

(2.113)

where the expected value and the variance are calculated on the single minibatch By = {y1 , . . . , ym } consisting of m activations of the current layer. Average μB and variance σB2 are approximated using the data from the minibatch as follows: 1 yi m m

μB ← σB2 ←

1 m

i=1 m  i=1

(yi − μB )2

2.13 Deep Learning

255

yi − μB yˆ i ← ' σB2 +  where yˆ i are the normalized values of the inputs of the lth layer while  is a small number added to the denominator to guarantee numerical stability (avoids division by zero). Note that simply normalizing each input of a layer can change what the layer can represent. Most activation functions present problems when the normalization is applied before the activation function. For example, for the sigmoid activation function, the normalized region is more linear than nonlinear. Therefore, it is necessary to perform a transformation to move the distribution from 0. We introduce for each activation, a pair of parameters γ and β that, respectively, scale and translate each normalized value yˆ i , as follows: y˜ i ← γ yˆ i + β

(2.114)

These additional parameters are learned along with the weight parameters W of the network during the training process through backpropagation in order to improve accuracy and speedup the training phase. Furthermore, normalization does not alter the network’s representative power. If you decide not to use BN, it places γ = √ σ 2 +  and β = μ thus obtaining y˜ = yˆ that is operating as an identity function that returns the original values back. The use of BN has several advantages: 1. It improves the flow of the gradient through the net, in the sense that the descent of the gradient can reduce the oscillations when it approaches the minimum point and converge faster. 2. Reduces network convergence times by making the network layers independent of the dynamics of the input data of the first layers. 3. It is allowed to have a more high learning rate. On a traditional network, high values of the learning factor can scale the parameters that could amplify the gradients, thus leading to saturation. With BN, small changes in the parameter to a layer are not propagated to other layers. This allows higher learning factors to be used for optimizers, which otherwise would not have been possible. Moreover, it makes the propagation of the gradient in the network more stable as indicated above. 4. Reduces dependence on accurate initialization of weights. 5. It acts in some way as a form of regularization motivated by the fact that using reduced minibatch samples reduces the noise introduced in each layer of the network. 6. Thanks to the regularization effect introduced by BN, the need to use the dropout is reduced.

256

2 RBF, SOM, Hopfield, and Deep Neural Networks

2.13.3.4 Monitoring of the Training Phase Once you have chosen the network architecture you want to use, you need to pay great attention to the initialization phase, as discussed above for data and weights, and to monitor several network parameters. A first verification of the implemented model is the verification of the loss function. Set λ = 0 to turn off the regularization and check the value of the objective function that calculates the network. If we are solving a 10 class problem, for example, the worst that the network can do is not to identify any of the classes in the output layer, that is, that no output neuron of the corresponding class realizes the maximum near the correct class. This means calculating the maximum value of the loss function, which in the case of a 10-class problem L = log(10) ≈ 2.3. Then the regularization term is activated with λ very high and it occurs if the loss function increases with respect to the value obtained previously (when λ was zero). The next test is to use a few dozen data to check if the model goes in overfitting, drastically reducing the loss function and getting 100% accuracy. Subsequently, the optimal training parameter must be identified, choosing a certain number of learning rate randomly (according to a uniform distribution) in a defined interval, for example, [10−6 , 10−3 ], and starting several training sessions from a tabula rasa configuration. The same can be done in parallel for the hyperparameter λ which controls the regularization term. In this way, you can record each accuracy result for the randomly set values and choose λ and η accordingly for the complete training phase. You can eventually wait to start the training phase on the whole dataset, repeating the hyperparameter search, narrowing down the search interval around those identified in the previous phase (coarse to fine search).

2.13.4 Main Architectures of CNN Networks Several are the architectures realized of convolutional networks, in addition to the LeNet described in the preceding paragraphs, the other most common are AlexNet, described in [19] is a deeper and much wider version of LeNet. Result of the first work in the competition ImageNet Large Scale Visual Recognition Challenge (ILSVRC) of 2012, making the convolutional networks in the Computer Vision sector so popular. It turned out to be an important breakthrough compared to previous approaches and the spread of CNN can be attributed to this work. It showed good performances and won by a wide margin the difficult challenge of visual recognition on large datasets (ImageNet) in 2012. ZFNet, described in [39] is an evolution of AlexNet but better performing, resulting winner in the competition ILSVRC 2013. Compared to AlexNet, the hyperparameters of the architecture have been modified, in particular, by expanding the size of the central convolutional layers and reducing the size of the stride and the window on the first layer. GoogLeNet, described in [40] differs from the previous ones for having inserted the inception module which heavily reduces the number of network parameters up to 4M compared to the 60M used in AlexNet. Result in the winner of ILSVRC 2014.

2.13 Deep Learning

257

V GGNet, described in [41] was ranked second in ILSVRC 2014. Its main contribution was the demonstration that the depth of the network (number of layers) is a critical component for achieving an optimal CNN. ResNet, described [42] and first classified in ILSVRC 2015. The originality of this network is the lack of the FC layer, the heavy use of the batch normalization and above all the idea winning of the so-called identity shortcut connection that skips the connections of one or more layers. ResNet is currently by far the most advanced of the CNN network models and is the default choice for using CNN networks. ResNet allows you to train up to hundreds or even thousands of layers and achieves high performance. Taking advantage of its powerful representation capacity, the performance of many artificial vision applications other than image classification, such as object detection and facial recognition [43,44], have been enhanced. The authors of ResNet, following the intuitions of the first version, have perfected the residual block and proposed a variant of pre-activation of the residual block [45], in which the gradients can flow without obstacles through the connections shortcut connection to any other previous layer. They have shown with experiments that they can now train a deep ResNet to 1001 layers to overcome their lesser counterparts. Thanks to its convincing results, ResNet has quickly become one of the strategic architectures also in various artificial vision applications. A more recent evolution has modified the original architecture by renaming the network to ResNeXt [46] by introducing a hyperparameter called cardinality, the number of independent paths, to provide a new way to adapt the model’s capacity. Experiments show that accuracy can be improved more efficiently by increasing cardinality rather than going deeper or wider. The authors say that, compared to the Inception module of GoogLeNet, this new architecture is easier to adapt to new datasets and applications, as it has a simple paradigm and only one hyperparameter to adjust, while the Inception module has many hyperparameters (such as the size of the convolution layer mask of each path) to fine-tune the network. A further evolution is described in [47,48] with a new architecture called DenseNet that further exploits the effects of shortcut connections that connect all the layers directly to each other. In this new architecture, the input of each layer consists of the feature maps of the entire previous layer and its output is passed to each successive layer. The maps of the features are aggregated with depth concatenation.

References 1. R.L. Hardy, Multiquadric equations of topography and other irregular surfaces. J. Geophys. Res. 76(8), 1905–1915 (1971) 2. J.D. Powell, Radial basis function approximations to polynomials, in Numerical Analysis, eds. by D.F. Griffiths, G.A. Watson (Longman Publishing Group White Plains, New York, NY, USA, 1987), pp. 223–241

258

2 RBF, SOM, Hopfield, and Deep Neural Networks

3. Lanczos Cornelius, A precision approximation of the gamma function. SIAM J. Numer. Anal. Ser. B 1, 86–96 (1964) 4. T. Kohonen, Selforganized formation of topologically correct feature maps. Biol. Cybern. 43, 59–69 (1982) 5. C. Von der Malsburg, Self-organization of orientation sensitive cells in the striate cortex. Kybernetik 14, 85–100 (1973) 6. S. Amari, Dynamics of pattern formation in lateral inhibition type neural fields. Biol. Cybern. 27, 77–87 (1973) 7. T. Kohonen, Self-Organizing Maps, 3rd edn. ISBN 3540679219 (Springer-Verlag New York, Inc. Secaucus, NJ, USA, 2001) 8. T. Kohonen, Learning vector quantization. Neural Netw. 1, 303 (1988) 9. T. Kohonen, Improved versions of learning vector quantization. Proc. Int. Joint Conf. Neural Netw. (IJCNN) 1, 545–550 (1990) 10. J.J. Hopfield, Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad. Sci 79, 2554–2558 (1982) 11. R.P. Lippmann, B. Gold, M.L. Malpass, A comparison of hamming and hopfield neural nets for pattern classification. Technical Report ESDTR-86-131,769 (MIT, Lexington, MA, 1987) 12. J.J. Hopfield, Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Nat. Acad. Sci 81, 3088–3092 (1984) 13. D.H. Ackley, G.E. Hinton, T.J. Sejnowski, A learning algorithm for boltzmann machines. Cogn. Sci. 9(1), 147–169 (1985) 14. J.A. Hertz, A.S. Krogh, R. Palmer, Introduction to the Theory of Neural Computation. (AddisonWesley, Redwood City, CA, USA, 1991). ISBN 0-201-50395-6 15. R. Ra´el, Neural Networks: A Systematic Introduction (Springer Science and Business Media, Berlin, 1996) 16. S. Paul, Information processing in dynamical systems: Foundations of harmony theory, in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations, eds. by D.E. Rumelhart, J.L. McLelland, vol. 1, Chapter 6 (MIT Press, Cambridge, 1986), pp. 194–281 17. G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006) 18. D.C. Ciresan, U. Meier, J. Masci, L.M. Gambardella, J. Schmidhuber, Flexible, high performance convolutional neural networks for image classification, in In International Joint Conference on Artificial Intelligence, pp. 1237–1242, 2011 19. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems, eds. by F. Pereira, C.J.C. Burges, L. Bottou, K.Q. Weinberger, vol. 25 (Curran Associates, Inc., Red Hook, NY, 2012), pp. 1097–1105 20. G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012) 21. M. Tomáš, Statistical language models based on neural networks. Ph.D. thesis, Brno University of Technology, 2012 22. Y. Lecun, Y. Bengio, G. Hinton. Nature, 521(7553), 436–444 (2015). ISSN 0028-0836. https:// doi.org/10.1038/nature14539 23. L. Yann, B.E. Boser, J.S. Denker, D. Henderson, R.E. Howard, W.E. Hubbard, L.D. Jackel, Handwritten digit recognition with a back-propagation network, in Advances in Neural Information Processing Systems, ed. by D.S. Touretzky, vol. 2 (Morgan-Kaufmann, Burlington, 1990), pp. 396–404

References

259

24. L. Yann, B. Leon, Y. Bengio, H. Patrick, Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998) 25. A.L. Maas, A.Y. Hannun, A.Y. Ng, Rectifier nonlinearities improve neural network acoustic models. Int. Conf. Mach. Learn. 30, 3 (2013) 26. K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in International Conference on Computer Vision ICCV, 2015a 27. D.A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep network learning by exponential linear units (elus), in ICLR, 2016 28. X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks. AISTATS 9, 249–256 (2010) 29. D. Mishkin, J. Matas, All you need is a good init. CoRR (2015). http://arxiv.org/abs/1511. 06422 30. P. Krähenbühl, C. Doersch, J. Donahue, T. Darrell, Data-dependent initializations of convolutional neural networks, CoRR (2015). http://arxiv.org/abs/1511.06856 31. D. Sussillo, Random walks: Training very deep nonlinear feed-forward networks with smart initialization. CoRR (2014). http://arxiv.org/abs/1412.6558 32. Q. Liao, T. Poggio, Theory of deep learning ii: Landscape of the empirical risk in deep learning. Technical Report Memo No. 066 (Center for Brains, Minds and Machines (CBMM), 2017) 33. D. John, H. Elad, S. Yoram, Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011). ISSN 1532-4435. http://dl.acm.org/ citation.cfm?id=1953048.2021068 34. S. Ruder, An overview of gradient descent optimization algorithms. CoRR (2016). http://arxiv. org/abs/1609.04747 35. D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, CoRR (2014). http://arxiv. org/abs/1412.6980 36. V. Patel, Kalman-based stochastic gradient method with stop condition and insensitivity to conditioning. SIAM J. Optim. 26(4), 2620–2648 (2016). https://doi.org/10.1137/15M1048239 37. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014). http://jmlr.org/papers/v15/srivastava14a.html 38. S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, CoRR (2015). http://arxiv.org/abs/1502.03167 39. M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, CoRR (2013). http://arxiv.org/abs/1311.2901 40. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, In Computer Vision and Pattern Recognition (CVPR) (IEEE, Boston, MA, 2015). http://arxiv.org/abs/1409.4842 41. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition. CoRR (2014). http://arxiv.org/abs/1409.1556 42. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition. CoRR, 2015b. http://arxiv.org/abs/1512.03385 43. M. Del Coco, P. Carcagn, M. Leo, P. Spagnolo, P. L. Mazzeo, C. Distante, Multi-branch cnn for multi-scale age estimation, in International Conference on Image Analysis and Processing, pp. 234–244, 2017 44. M. Del Coco, P. Carcagn, M. Leo, P. L. Mazzeo, P. Spagnolo, C. Distante, Assessment of deep learning for gender classification on traditional datasets, in In Advanced Video and Signal Based Surveillance (AVSS), pp. 271–277, 2016 45. K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, CoRR, 2016. http://arxiv.org/abs/1603.05027

260

2 RBF, SOM, Hopfield, and Deep Neural Networks

46. S. Xie, R.B. Girshick, P. Dollár, Z. Tu, K. He, Aggregated residual transformations for deep neural networks. CoRR (2016). http://arxiv.org/abs/1611.05431 47. G. Huang, Z. Liu, K.Q. Weinberger, Densely connected convolutional networks, CoRR, 2016a. http://arxiv.org/abs/1608.06993 48. G. Huang, Y. Sun, Z. Liu, D. Sedra, K.Q. Weinberger, Deep networks with stochastic depth, CoRR, 2016b. http://arxiv.org/abs/1603.09382

3

Texture Analysis

3.1 Introduction The texture is an important component for recognizing objects. In the field of image processing, it is consolidated with the term textur e, any geometric and repetitive arrangement of the gray levels of an image [1]. In this sector, texture becomes an additional strategic component to solve the problem of object recognition, the segmentation of images, and the problems of synthesis. Several researches have been oriented on the mechanisms of human visual perception of texture, to be emulated for the development of systems for the automatic analysis of the information content of an image (partitioning of the image in regions with different textures, reconstruction, and orientation of the surface, etc.).

3.2 The Visual Perception of the Texture The human visual system easily determines and recognizes different types of textures characterizing them in a subjective way. In fact, there is no general definition of texture and a method of measuring the texture accepted by all. Without entering into the merits of our ability to perceive texture, our qualitative approach to characterizing texture with coarse, granular, random, ordered, threadlike, dotted, fine-grained, etc., attributes are known. Several studies have shown [2,3] that the quantitative analysis of the texture passes through statistical and structural relationships between the basic elements (known as primitive of the texture also called texel) of what we call texture. Our visual system easily determines the relationships between the fundamental geometric structures, which characterize a specific texture, composed of macrostructures, as can be the regular covering of a building or a floor. Similarly, our

© Springer Nature Switzerland AG 2020 A. Distante and C. Distante, Handbook of Image Processing and Computer Vision, https://doi.org/10.1007/978-3-030-42378-0_3

261

262

3 Texture Analysis

visual system is able to easily interpret a texture composed of microstructures, as is seen by observing from a satellite image various types of cultivations of a territory, or the microstructures associated with artifacts or those observed from a microscope image. Can we say that our visual system rigorously describes the microstructures of a texture? The answer does not seem obvious given that often the characterization of a texture is very subjective. We also add that often our subjectivity also depends on the conditions of visibility and the particular context. In fact, the observed texture characteristics of a building’s coating depends very much on the distance and lighting conditions that influence the perception and interpretation of what is observed. In other words, the interpretation of the texture depend very much on the scale factor with which the texture primitives are observed, on how they are illuminated, and on the size of the image considered. From the point of view of a vision machine, we can associate the definition of texture for an image as a particular repetitive arrangement of pixels with a given local variation of gray or color levels that identify one or more regions (textured regions). In Figs. 3.1 and 3.2 are shown, respectively, a remote sensing image of the territory and images acquired by the microscope that highlight some examples of textures related to microstructures and macrostructures, and images with natural and artificial textures. From these examples, we can observe the different complexity of textures, in particular in natural ones, where the distribution of texture components has an irregular stochastic arrangement. A criterion for defining the texture of a region is that of homogeneity which can be perceptually evaluated in terms of color or gray levels with spatial regularity. From the statistical point of view, homogeneity implies a stationary statistic, i.e., the statistical information of each texture is identical. This characteristic is correlated to the similarity of the structures as it happens when they have different scales, that despite not identical, present the same statistics. If we consider the criterion of spatial homogeneity, textures can be classified as homogeneous (when spatial structures are r egular ), poorly homogeneous (when structures repeat with some spatial variation) and inhomogeneous (that is, there are no similar structures that replicate spatially). The visual perception of texture has been studied at an interdisciplinary level by the cognitive science and psychophysiology community, exploring aspects of the neural processes involved in perception to understand the mechanisms of interpretation and segregation (seen as a perceptive characteristic sensitive to similarity) of the texture. The computational vision community has developed research to emulate the perceptual processes of texture to the computer, deriving appropriate physicalmathematical models to recognize classes of textures, useful for classification and segmentation processes.

3.2 The Visual Perception of the Texture

263

Fig. 3.1 Remote sensing image

Fig. 3.2 Images acquired from the microscope, in the first line, natural images (the first three of the second line) and artificial images (the last two)

3.2.1 Julesz’s Conjecture The first computational texture studies were conducted by Julesz [2,3]. Several experiments have demonstrated the importance of perception through the statistical analysis of the image for various types of textures to understand how low-level vision responds to variations in the order of statistics. Various images with different statistics have been used, with patterns (such as points, lines, symbols, ...) distributed and positioned randomly, each corresponding to a particular order statistic. Examples

264

3 Texture Analysis

Fig. 3.3 Texture pairs with identical and different second-order statistics. In a the two textures have pixels with different second-order statistics and are easily distinguishable from people; b the two textures are easily distinguishable but have the same second-order statistics; c textures with different second-order statistics that are not easy to distinguish from the human unless it closely scrutinizes the differences; d textures share the same second-order statistics but an observer does not immediately discriminate the two textures

of order statistics used are those of the first order (associated with the contrast), of the second order to characterize homogeneity and those of the third order to characterize the curvature. Textures with sufficient differences in brightness are easily separable in different regions with the first order statistics. Textures with differently oriented structures are also easily separable. Initially, Julesz finds that, with similar first-order statistics (gray-level histogram), but with different second-order statistics (variance), they are easily discriminable. However, no textures with identical statistics of the first and second order are found, but with different statistics of the third order (moments), which could be discriminated. This led to Julesz’s conjecture: textures with identical second-order statistics are indistinguishable. For example, the texture of Fig. 3.3a have different second-order statistics and are immediately distinguishable, while the textures of Fig. 3.3d share the same statistics and are not easily distinguishable [4].

3.2.2 Texton Statistics Later, Caelli, Julesz, and Gilbert [5] find textures with identical first and secondorder statistics, but with different third-order statistics, which are distinguishable (see Fig. 3.3b) with pre-active visual perception, thus violating Julesz’s conjecture. Further studies by Julesz [3] show that the mechanisms of human visual perception do not necessarily use third-order statistics to distinguish textures with identical secondorder statistics, but instead use second-order statistics of patterns (structures) called texton. Consequently, the previous conjecture is revised as follows: the human preactive visual system does not calculate major statistical parameters of the second order.

3.2 The Visual Perception of the Texture

265

It is also stated that the pre-attentive human visual system1 uses only the firstorder statistics of these texton which are, in fact, the local structural characteristics, such as edges, orientated lines, blobs, edges, etc.

3.2.3 Spectral Models of the Texture Psychophysiological research has shown the evidence that the human brain performs a spectral analysis on the retinal image that can be emulated on the computer by modeling with a filter bank [6]. This research has motivated the development of mathematical models of texture perception based on appropriate filters. Bergen [7] suggests that the texture present in an image can be decomposed into a series of image sub-bands using a bank of linear filters with different scales and orientations. Each sub-band image is associated with a particular type of texture. In essence, the texture is characterized by the empirical distribution of the filter response form and by the similarity metrics used (distance between distributions) can be evaluated to discriminate the potential differences in textures.

3.3 Texture Analysis and its Applications The research described in the previous paragraph did not lead to a rigorous definition of texture but led to a variety of approaches (physical-mathematical models) to analyze and interpret the texture. In developing a vision machine, texture analysis (which exhibits a myriad of properties) can be guided by the application and can be requested at various stages of the automatic viewing process. For example, in the context of segmentation to detect homogeneous regions, in the extraction phase of the feature or in the classification phase where the texture characteristics can provide useful information for object recognition. Given the complexity of the information content of the local structures of a texture, which can be expressed in several perceptive terms (color, light intensity, density, orientation, spatial frequency, linearity, random arrangement, etc.), texture analysis methods can be grouped into four classes: 1. Statistical approach. The texture is expressed as a quantitative measure of the local distribution of pixel intensity. This local distribution of pixels generates statistical and structural relationships between the basic elements, known as primitive of the texture (known as texel). The characterization of texel can be 1 Stimulus

processing does not always require the use of attentional resources. Many experiments have shown that the elementary characteristics of a stimulus derived from the texture (as happens for color, shape, movement) are detected without the intervention of attention. The processing of the stimulus is, therefore, defined as pre-attentive. In other words, the pre-attentive information process allows to detect the most salient features of the texture very quickly, and only at a second time the focused attention completes the recognition of the particular texture (or of an object in general).

266

3 Texture Analysis

dominated by local structural properties, and in this case we obtain a symbolic description of the image, in terms of microstructures and macrostructures modeling from primitives. For images with natural textures (for example, satellite images of the territory) primitives are characterized by local pixel statistics, derived from the co-occurrence matrix of gray levels (GLCM) [8]. 2. Structural Approach. The texture can be described with 2D patterns through a set of primitives that capture local geometric structures (texton) according to certain positioning rules [8]. If the primitives are well defined to completely characterize the texture (in terms of micro and macro structures), we obtain a symbolic description of the image that can be reconstructed. This approach is more effective for synthesis applications than for classification. Other structural approaches are based on morphological mathematical operators [9]. This approach is very useful in the detection of defects characterized by microstructures present in the artifacts. 3. Stochastic approach. With respect to the two previous approaches, it is more general. The texture is assumed to derive from a realization of a stochastic process that captures the information content of the image. The non-deterministic properties of the spatial distribution of texture structures are determined by estimating the parameters of the stochastic model. There are several methods used to define the model that governs the stochastic process. Parameter estimation can be done with the Monte Carlo method, maximum likelihood or with methods that use nonparametric models. 4. Spectral Approach. Based on the classical Fourier transforms and wavelet where in these domains the image is represented in terms of frequencies to characterize the textures. The wavelet transform [10] and the Gabor filters are used to maintain the spatial information of the texture [11]. Normally, filter banks [12] are used to better capture the various frequencies and orientation of the texture structures. Other approaches, based on fractal models [13] have been used to better characterize natural textures, although they have the limit of losing local information and orientation of texture structures. Another method of texture analysis is to compare the characteristics of the texture observed with a previously synthesized texture pattern. These texture analysis approaches are used in application contexts: extraction of the image features described in terms of texture properties; segmentation where each region is characterized by a homogeneous texture; classification to determine homogeneous classes of texture; reconstruction of the surface (shape from texture) of the objects starting from the information (such as density, dimension, and orientation) associated with the macrostructures of the texture; and synthesis to create extensive textures starting from small texture samples, very useful in rendering applications (computational graphics). For this last aspect, with respect to classification and segmentation, the synthesis of the texture requires a greater characterization of the texture in terms of description of the details (accurate discrimination of the structures).

3.3 Texture Analysis and its Applications

267

The classification of the texture concerns the search for particular regions of textures between various classes of predefined textures. This type of analysis can be seen as an alternative to the techniques of supervised classification and analysis of clusters or as a complement to further refine the classes found with these techniques. For example, a satellite image can be classified with the techniques associated with the analysis of the texture, or, classified with the clustering methods (see Chap. 1) and then it can be further investigated to search for details subclasses that characterize a given texture, for example, in the case of a forest, cropland or other regions. The statistical approach is particularly suitable when the texture consists of very small and complex elementary primitives, typical of microstructures. When a texture is composed of primitives of large dimensions (macrostructures) it is fundamental to first identify the elementary primitives, that is, to evaluate their shape and their spatial relationship. These last measures are often also altered by the effect of the perspective view that modifies the shape and size of the elementary structures, and by the lighting conditions.

3.4 Statistical Texture Methods The statistical approach is particularly suitable for the analysis of microstructures. The elementary primitives of microstructures are determined by analyzing the texture characteristics associated with a few pixels in the image. Recall that for the characterization of the texture it is important to derive the spatial dependence of the pixels with the values of gray levels. The distribution of gray levels in a histogram is not a useful information for the texture because it does not contain any spatial properties.

3.4.1 First-Order Statistics Measure the likelihood of observing a gray value in random positions in the image. The statistics of the first order can be calculated from the histogram of the gray levels of the image pixels. This depends only on the single gray level of the pixel and not on the co-occurring interaction with the surrounding pixels. The average intensity in the image is an example of first-order statistics. Recall that from the histogram we can derive the p(L) approximate probability density2 of the occurrence of an intensity level L (see Sect. 6.9 Vol. I), given by

2 It

is assumed that L is a random variable that expresses the gray level of an image deriving from a stochastic process.

268

3 Texture Analysis

p(L) =

H (L) Ntot

L = 0, 1, 2, . . . , L max

(3.1)

where H (L) indicates the frequency of pixels with gray level L and Ntot is the total number of pixels in the image (or a portion thereof). We know that from the shape of the histogram we can get information on the characteristics of the image. A narrow peak level distribution indicates an image with low contrast, while several isolated peaks suggest the presence of different homogeneous regions that differ from the background. The parameter that characterizes the first-order statistic is given by the average μ, calculated as follows: μ=

L max

L · p(L)

(3.2)

L=0

3.4.2 Second-Order Statistics To obtain useful parameters (features of the image) from the histogram, one can derive quantitative information from the statistical properties of the first order of the image. In essence we can derive the central moments (see Sect. 8.3.2 Vol. I) from the probability density function p(L) and characterize the texture with the following measures: L max (L − μ)2 p(L) (3.3) μ2 = σ 2 = L=0

where μ2 is the central moment of order 2, that is the variance, traditionally indicated with σ 2 . The average describes the position and the variance describes the dispersion of the distribution of levels. The variance in this case provides a measure of the contrast of the image and can be used to express a measure of smoothness relative S, given by 1 (3.4) S =1− 1 + σ2 which becomes 0 for regions with constant gray levels, while it tends to 1 in rough areas (where the variance is very large).

3.4.3 Higher Order Statistics The central moments of higher order (normally not greater than 6-th), associated with the probability density function p(L), can characterize some properties of the texture. The central moment n-th in this case is given by

3.4 Statistical Texture Methods

269

μn (L) =

L max

(L − μ)n p(L)

(3.5)

L=0

For natural textures, the measurements of asymmetry (or Skewness) S and Kurtosis K are useful, based, respectively, on the moments μ3 and μ4 , given by  L max

L=0 (L

μ3 S = n3 = σ μ4 K = n4 = σ

 L max

− μ)3 p(L) σ3

L=0 (L

− μ)4 p(L) σ3

(3.6)

(3.7)

The measure S is zero if the histogram H (L) has a symmetrical form with respect to the average μ. Instead it assumes positive or negative values in relation to its deformation (with respect to that of symmetry). This deformation occurs, respectively, on the right (that is, shifting of the shape toward values higher than the average) or on the left (deformation toward values lower than the average). If the histogram has a normal distribution S is zero, and consequently any symmetric histogram will have a value of S which tends to zero. The K measure indicates whether the shape of the histogram is flat or has a peak with respect to a normal distribution. In other words, if K has high values, it means that the histogram has a peak near the average value and decays quickly at both ends. A histogram with small K tends to have a flat top near the mean value rather than a net peak. A uniform histogram represents the extreme case. If the histogram has a normal distribution K takes value three. Often the measurement of kurtosis is indicated with K e = K − 3 to have a value of zero in the case of a histogram with normal distribution. Other useful texture measures, known as Energy E and Entr opy H , are given by the following: L max [ p(L)]2 (3.8) E= L=0

H =−

L max

p(L) log2 p(L)

(3.9)

L=0

The energy measures the homogeneity of the texture and is associated at the moment according to the probability distribution density of the gray levels. Entropy is a

270

3 Texture Analysis

quantity that in this case measures the level of the disorder of a texture. The maximum value is when the gray levels are all equiprobable (that is, p(L) = 1/(L max + 1)). Characteristic measures of local texture, called module and state, based on the local histogram [14], are derivable considering a window with N pixel centered in the pixel (x, y). The module I M H is defined as follows: IM H =

L max L=0



H (L) − N /N Liv H (L)[1 − p(L)] + N /N Liv (1 − 1/N Liv )

(3.10)

where N Liv = L max + 1 indicates the number of levels in the image. The state of the histogram is the gray level that corresponds to the highest frequency in the local histogram.

3.4.4 Second-Order Statistics with Co-Occurrence Matrix The characterization of the texture based on second-order statistics, derived in Sect. 3.4.2, has the advantage of computational simplicity but does not include spatial pixel information. More robust texture measurements can be derived according to Julesz’s studies (Sect. 3.2) on human visual perception of texture. From these studies, it emerges that two textures, having identical second-order statistics, are not easily distinguishable at a first observation level. Although this conjecture has been refuted and affirmed, the importance of secondorder statistics is known. Therefore, the main statistical method used for texture analysis is that based on the definition of joint probability of pixel pair distributions, defined as the likeli hood in observing a pair of gray levels measured at the ends of a segment randomly positioned in the image and with a random orientation. This definition leads to the co-occurrence matrix used to extract the texture features proposed by Haralick [1] known as GLCM (Gray-Level Co-occurrence Matrix). The spatial relation of the gray levels is expressed by the co-occurrence matrix PR (L 1 , L 2 ) representing the two-dimensional histogram of the gray levels of the image, considered as an estimate of joint probability that a pair of pixels has levels of gray L 1 and L 2 and that they satisfy a spatial relation of distance R, for example, the two pixels are between them at a distance d = (dx , d y ) expressed in pixels. It is known that each element of the two-dimensional histogram indicates the joint frequency (co-occurrence) of the presence in the image of pairs of gray level L 1 = I (x, y) and L 2 = I (x + dx , y + d y ) in which the two pixels are at a distance d defined by the displacements along the rows (downward, vertical) dx and shifts along the columns (to the right, horizontal) d y expressed in pixels. Given an image I of size M × M, the co-occurrence matrix P can be formulated as follows:  M  M  1 if I(x, y) = L 1 and I(x + dx,y + d y ) = L 2 P(dx ,d y ) (L 1 , L 2 ) = 0 otherwise x=1 y=1 (3.11)

3.4 Statistical Texture Methods Fig. 3.4 Directions and distances of the pairs of pixels considered for the calculation of the co-occurrence matrix

271

(90°,1) [-1,0] (1 35°,1) [-1,-1]

(45°,1) [-1,1] (0 °,1) [0,1]

Pixel processing

(θ,d) [dx,dy]

The size of P depends on the number of levels in the image I. A binary image generates a 2 × 2 matrix, an RGB color image would require a matrix of 224 × 224 , but typically grayscale images are used up to a maximum of 256 levels. The spatial relationship between pixel pairs defined in terms of increments (d x , d y ) generates co-occurrence matrices sensitive to image rotation. Except for rotations of 180◦ , any other rotation would generate a different distribution of P. To obtain the invariance to the rotation for the analysis of the texture co-occurrence matrices are calculated considering rotations of 0◦ , 45◦ , 90◦ , 135◦ . For this purpose it is useful to define the co-occurrence matrices of the type PR (L 1 , L 2 ) where the spatial relation R = (θ, d) indicates the co-occurrence of pixel pairs at a distance d and in the direction θ (see Fig. 3.4). These matrices can be calculated to characterize texture microstructures considering the unit distance between pairs, that is, d = 1. The rotation invariant matrices would result: P(dx ,d y ) = P(0,1) ⇐⇒ P(θ,d) = P(0◦ ,1) (horizontal direction); P(dx ,d y ) = P(−1,1) ⇐⇒ P(θ,d) = P(45◦ ,1) (right diagonal direction); P(dx ,d y ) = P(−1,0) ⇐⇒ P(θ,d) = P(90◦ ,1) (above vertical direction); and P(dx ,d y ) = P(−1,−1) ⇐⇒ P(θ,d) = P(135◦ ,1) (left diagonal direction). Figure 3.5 shows the GLMC matrices calculated for the 4 angles indicated above, for a test image of 4 × 4 size which has maximum level L max = 4. Each matrix has the size of N × N where N = L max + 1 = 5. We will now analyze the element P(0◦ ,1) (2, 1) which has value 3. This means that in the test image there are 3 pairs of pixels horizontally adjacent with gray values, respectively, L 1 = 2 the pixel under consideration and L 2 = 1 the adjacent co-occurring pixel. Similarly, there are adjacent pairs of pixels with values (1, 2) with frequency 3 examining in the opposite direction. It follows that the matrix is symmetric like the other three calculated. In general, the co-occurrence matrix is not always symmetrical, that is, not always P(L 1 , L 2 ) = P((L 2 , L 1 ). The symmetric co-occurrence matrix Sd (L 1 , L 2 ) is defined as the sum of the matrix Pd (L 1 , L 2 ) associated with the distance vector d and the vector −d: (3.12) Sd (L 1 , L 2 ) = Pd (L 1 , L 2 ) + P−d (L 1 , L 2 )

272

3 Texture Analysis C 0 0

3 3 2 1 0 3 2 1

4 3 2 1

Image

2 4 1 3

1 2 3 4

1

2

3

4

0 0 0 1 0 0 2 3 2 0 0 3 0 1 1 1 2 1 2 2 0 0 1 2 0 (0°,1) [0,1]

0 1 2 3 4

0

1

2

3

4

0 1 0 0 0

1 2 1 0 1

0 1 0 3 1

0 0 3 2 0

0 1 1 0 0

(45°,1) [-1,1]

0 1 2 3 4

0

1

2

3

4

0 0 2 0 0

0 0 1 4 1

2 1 0 2 1

0 4 2 0 1

0 1 1 1 0

(90°,1) [-1,0]

0 1 2 3 4

0

1

2

3

4

0 1 0 0 0 1 0 1 3 0 0 1 0 2 0 0 3 2 2 0 0 0 0 0 2 (135°,1) [-1,-1]

Fig. 3.5 Calculation of 4 co-occurrence matrices, relative to the 4 × 4 test image with number of levels 5, for the following directions: a 0◦ , b to +45◦ , c to +90◦ , and d to +135◦ . The distance is d = 1 pixel. It is also shown how the element P0◦ ,1 (L 1 , L 2 ) = (2, 1) = 3 exists in the image three pairs of pixels with L 1 = 2, L 2 = 1 arranged spatially according to the relation (0◦ , 1)

By obtaining symmetric matrices one has the advantage of operating on a smaller number of elements equal to N (N + 1)/2 instead of N 2 , where N is the number of levels of the image. The texture structure can be observed directly in the cooccurrence matrix by analyzing the frequencies of the pixel pairs calculated for a determined spatial relationship of the pixel pairs. For example, if the distribution of gray levels in the image is random, we will have a co-occurrence matrix with scattered frequency values. If the pairs of pixels in the image are instead very correlated we will have a concentration of elements of Pd (L 1 , L 2 ) around the main diagonal with high values of the frequency. The fine structures of the texture are identified by analyzing pairs of pixels of the image with distance vector d small, while coarse structures are characterized with large values of d. The quantitative texture information depends on the number of image levels. Normally the number of gray levels of an image is N = 256 but to reduce the computational load it is often quantized into a smaller number of levels according to an acceptable loss of texture information. In Fig. 3.6 shows some co-occurrence matrices derived from images with different types of textures. We observe the different distribution of very compact frequencies and along the main diagonal on images with high spatial pixel correlation compared to images with very complex textures with poor spatial correlation between the pixels with densely distributed frequencies over the entire matrix (the image of the circuit in the example).

3.4.5 Texture Parameters Based on the Co-Occurrence Matrix The co-occurrence matrix captures some properties of a texture but is not normally characterized by using the elements of this matrix directly. From the co-occurrence matrix, some significant Ti parameters are derived to describe a texture more compactly. Before describing these parameters, it is convenient to normalize the cooccurrence matrix by dividing each of its elements Pθ,d (L 1 , L 2 ) with the total sum of the frequencies of all pairs of spatially related pixels from R(θ, d). In this way, the

3.4 Statistical Texture Methods

273

Fig. 3.6 Examples of co-occurrence matrices calculated on 3 different images with different textures: a complex texture of an electronic board where there are little homogeneous regions and small dimensions; b a little complex texture with larger macrostructures and greater correlation between the pixels; and c more homogeneous texture with the distribution of frequencies concentrated along the main diagonal (strongly correlated pixels)

co-occurrence matrix Pθ,d becomes the joint probability estimate pθ,d (L 1 , L 2 ) of the occurrence in the image of the pixel pair with levels L 1 and L 2 far from each other of d pixels in the direction indicated by θ . The normalized co-occurrence matrix is defined as follows: Pθ,d (L 1 , L 2 ) pθ,d (L 1 , L 2 ) =  L  L max max L 1 =0 L 2 =0 Pθ,d (L 1 , L 2 )

(3.13)

where the joint probabilities assume values between 0 and 1. From the normalized co-occurrence matrix, given by the (3.13), the following characteristic parameters Ti of the texture are derived.

3.4.5.1 Energy or Measure of the Degree of Homogeneity of the Texture The energy is a parameter that corresponds to the angular momentum of the second order: L max L max  2 pθ,d (L 1 , L 2 ) (3.14) Energia = T1 = L 1 =0 L 2 =0

274

3 Texture Analysis

Higher energy values correspond to very homogeneous textures, i.e., the differences in gray values are almost zero in most pixel pairs, for example, with a distance of 1 pixel. For low energy values, there are differences that are equally spatially distributed.

3.4.5.2 Entropy The entr opy is a parameter that measures the random distribution of gray levels in the image. L max L max 

Entr opy = T2 = −

pθ,d (L 1 , L 2 ) · log2 [ pθ,d (L 1 , L 2 )]

L 1 =0 L 2 =0

(3.15)

It is observed that entropy is high when each element of the co-occurrence matrix has an equal value, that is, when the p(L 1 , L 2 ) probabilities are equidistributed. Entropy has low values if the co-occurrence matrix is diagonal, i.e., there are spatially dominant gray level pairs for a certain direction and distance.

3.4.5.3 Maximum Probability T3 = max pθ,d (L 1 , L 2 )

(3.16)

(L 1 ,L 2 )

3.4.5.4 Contrast The contrast is a parameter that measures the local variation of the gray levels of the image. Corresponds to the moment of inertia. L max L max 

T4 =

L 1 =0 L 2 =0

(L 1 − L 2 )2 pθ,d (L 1 , L 2 )

(3.17)

A low value of the contrast is obtained if the image has almost constant gray levels, vice versa it presents high values for images with strong local variations of intensity that is with very pronounced texture.

3.4.5.5 Moment of the Inverse Difference L max L max  pθ,d (L 1 , L 2 ) T5 = 1 + (L 1 − L 2 )2

with L 1 = L 2

(3.18)

L 1 =0 L 2 =0

3.4.5.6 Absolute Value T6 =

L max L max  L 1 =0 L 2 =0

|L 1 − L 2 | pθ,d (L 1 , L 2 )

(3.19)

3.4 Statistical Texture Methods

275

3.4.5.7 Correlation  L max  L max T7 =

L 1 =0

L 2 =0 [(L 1

− μx )(L 2 − μ y ) pθ,d (L 1 , L 2 )] σx σ y

(3.20)

where the mean m x and mu y , and the standard deviations σx and σ y are related to the marginal probabilities px (L 1 ) and p y (L 2 ). The latter correspond, respectively, to the rows and columns of the co-occurrence matrix pθ,d (L 1 , L 2 ), and are defined as follows:   px (L 1 ) = LL max p (L 1 , L 2 ) p y (L 2 ) = LL max p (L 1 , L 2 ) 2 =0 θ,d 1 =0 θ,d Therefore, the averages are calculated as follows:   μx = LL max L 1 px (L 1 ) μ y = LL max L 2 p y (L 2 ) 1 =0 2 =0

(3.21)

and the standard deviations σx , σ y are calculated as follows: σx =



L max L 1 =0

px (L 1 )(L 1 − μx )2 σ y =

 L max

L 2 =0

p y (L 2 )(L 2 − μ y )2

(3.22)

The texture characteristics T = T1 , T2 , . . . , T7 , calculated based on the co-occurrence matrix of the gray levels, may be effective to describe complex microstructures of limited sparse dimensions and variously oriented in the image. Figure 3.7 shows examples of different textures characterized by contrast, energy, homogeneity, and entropy measurements. These measurements are obtained by calculating the cooccurrence matrices with a distance of 1 pixel and direction 0◦ . Original images

Contrast

Energy

Homogeneity

Entropy

Fig. 3.7 Results of texture measurements by calculating co-occurrence matrices on windows 7 × 7 on 3 test images. For each image the following measurements have been calculated: a Contrast, b Energy, c Homogeneity, and d Entropy. For all measurements the distance is 1 pixel and the direction of 0◦

276

3 Texture Analysis

The resulting images related to the 4 measurements are obtained by calculating the co-occurrence matrices on windows of 7 × 7 centered on each pixel of the input image. For reasons of space, the different results have not been reported by varying the size of the window which depends on the resolution level with which the texture is to be analyzed and the application context. The set of these measures Ti can be used to model particular textures to be considered as prototypes to be compared with those unknowns extracted from an image and to evaluate with a metric (for example, the Euclidean distance) the level of similarity. The texture measurements Ti can be weighted differently according to the level of correlation observed. Additional texture measurements can be derived from co-occurrence matrices [15,16]. An important aspect in the extraction of the texture characteristics from the cooccurrence matrix concerns the appropriate choice of the distance vector d. A possible suggested solution [17] is that of the statistical test of the chi square X2 for the proper choice of the values of d that best highlight the texture structures maximizing the value L max L max  pd2 (L 1 , L 2 ) −1 (3.23) X2 (d) = px (L 1 ) p y (L 2 ) L 1 =0 L 2 =0

From the implementation point of view, the extraction of the texture characteristics, based on the co-occurrence matrices, requires a lot of memory and calculation. However, in literature there are solutions that quantize the image with few levels of gray thus reducing the dimensionality of the co-occurrence matrix with the foresight to balance the eventual degradation of the structures. In addition, solutions with fast ad hoc algorithms are proposed [18]. The complexity, in terms of memory and calculation, of the co-occurrence matrix increases with the management of color images.

3.5 Texture Features Based on Autocorrelation A feature of the texture is evaluated by spatial frequency analysis. In this way, the repetitive spatial structures of the texture are identified. Primitives of textures characterized by fine structures present high values of spatial frequencies, while primitive with larger structures result in low spatial frequencies. The autocorrelation function of an image can be used to evaluate spatial frequencies, that is, to measure the level of homogeneity or roughness (fineness/coarseness) of the texture present in the image. With the autocorrelation function (see Sect. 6.10.2 Vol. I) we measure the level of spatial correlation between neighboring pixels seen as texture primitives (graylevel values). The spatial arrangement of the texture is described by the correlation coefficient that measures the linear spatial relationship between pixels (primitive). For an image f (x, y) with a size of N × N with L gray levels, the autocorrelation function ρ f (dx , d y ) is given by:

3.5 Texture Features Based on Autocorrelation

277 4

4 x10 Autocorrelation function Tex. 1

Texture 1

Texture 2

x 10 Autocorrelation function Tex. 2 2.1 2.05 2 1.95 1.9 1.85 1.8 1.75 1.7 1.65 1.6

2.14 2.12 2.1 2.08 2.06 2.04 2.02 50

100

150

200

250

50

100

Axis x

150

200

250

Axis x 4

x10 Autocorrelation function Tex. 4

Autocorrelation function Tex. 3 4350 1.45

4300

Texture 3

Texture 4

4250

1.4

4200

1.35

4150

1.3

4100 1.25

4050 50

100

150

200

250

50

Axis x

100

150

200

250

Axis x

Fig. 3.8 Autocorrelation function along the x axis for 4 different textures

ρ f (dx , d y ) =

 N −1  N −1 r =0

c=0 f (r, c) · f (r + dx , c  −1  N −1 2 N 2 · rN=0 c=0 f (r, c)

+ dy )

(3.24)

where dx , d y = −N + 1, −N + 2, . . . , 0, . . . , N − 1 are the distances, respectively, in the direction x and y with respect to the pixel f (x, y). The size of the autocorrelation function will be the size of (2N − 1) × (2N − 1). In essence, with the autocorrelation function the inner product is calculated between the original image f (x, y) and the image translated to the position (x + dx , y + d y ) and for different displacements. The goal is the detection of repetitive pixel structures. If the primitives of the texture are rather large, the value of the autocorrelation function slowly decreases with increasing distance d = (dx , d y ) (presence of coarse texture with low spatial frequencies) if it decreases quickly we are in the presence of fine textures with high spatial frequencies (see Fig. 3.8). If instead the texture has primitives arranged periodically, the function reproduces this periodicity by increasing and decreasing (peaks and valley) according to the distance of the texture. The autocorrelation function can be seen as a particular case of convolution3 or as the convolution of an input function with itself but in this case without spatially reversing, around the origin, the second function as in convolution. It follows that, to get the autocorrelation is required, the flow of f (r − dx , c − d y ) with respect to f (dx , d y ) and adding their product. Furthermore, if the function f is real the convolution and the autocorrelation differ only in the change of the sign of the argument.

the 2D convolution operator g(x,  y) between two functions f (x, y) and h(x, y) is defined as: g(x, y) = f (x, y)  h(x, y) = r c f (r, c)h(x − r, y − c).

3 Recall

278

3 Texture Analysis

In this context the ρ f autocorrelation would be given by ρ f (dx , d y ) = f (dx , d y )  f (−dx , −d y ) =

−1 N −1 N  

f (r, c) f (r + dx , c + d y )

r =0 c=0

(3.25) An immediate way to calculate the autocorrelation function is obtained by virtue of the convolution theorem (see Sect. 9.11.3 Vol. I) which states: the Fourier transform of a convolution is the product of the Fourier transforms of each function. In this case, applied to the (3.25) we will have F {ρ f (dx , d y )} = F { f (dx , d y ) f (−dx , −d y )} = F(u, v) · F(−u, −v) = F(u, v) · F ∗ (u, v) = |F(u, v)|2

(3.26)

where the symbol F {•} indicates the Fourier transform operation and F indicates the Fourier transform of the image f . It is observed that the complex conjugate of a real function does not influence the function itself. In essence, the Fourier transform of the autocorrelation function F {ρ f (dx , d y )}, defined by the (3.26), represents the Power Spectrum P f (u, v) = |F(u, v)|2 of the function f (dx , d y ) where | • | is the module of a complex number. Therefore, considering the inverse of the Fourier transform of the convolution, i.e., the equation (3.26), we obtain the original expression of the autocorrelation function (3.25): ρ f (dx , d y ) = F −1 {|F(u, v)|2 }

(3.27)

We can then say that the autocorrelation of a function is the inverse of the Fourier transform of the power spectrum. Also with f real the autocorrelation is also real and symmetric ρ f (−dx , −d y ) = ρ f (dx , d y ). Figure 3.8 shows the autocorrelation function calculated with the (3.27) for four different types of textures.

3.6 Texture Spectral Method An alternative method for measuring the spatial frequencies of the texture is based on the Fourier transform. From the Fourier theory, it is known that many real surfaces can be represented in terms of sinusoidal base functions. In the Fourier spectral domain, it is possible to characterize the texture present in the image in terms of energy distributed along the base vectors. The analysis of the texture, in the spectral domain, is effective when it is composed of repetitive structures, however, oriented. This approach has already been used to improve image quality (see Chap. 9 Vol. I) and noise removal (Chap. 4 Vol. II) using filtering techniques in the frequency domain. In this context, the characterization of the texture in the spectral domain takes place

3.6 Texture Spectral Method

279

Fig. 3.9 Images with 4 different types of textures and relative power spectrum

by analyzing the peaks that give the orientation information of the texture and the location of the peaks that provide the spatial periodicity information of the texture structures. Statistical texture measurements (described above) can be derived after filtering the periodic components. The first methods that made use of these spectral features divide the frequency domain into concentric rings (based on the frequency content) and into segments (based on oriented structures). The spectral domain is, therefore, divided into regions and the total energy of each region is taken as the feature characterizing the texture. Let us consider with F(u, v) the Fourier transform of the image f (i, j) whose texture is to be measured and with |F(u,v)|2 the power spectrum (the symbol | • | represents the module of a complex number), which we know to coincide with the Fourier transform of the autocorrelation function ρ f . Figure 3.9 shows 4 images with different types of textures and their power spectrum. It can be observed how texture structures, linear vertical and horizontal, and those curves are arranged in the spectral domain, respectively, horizontal, vertical, and circular. The more textured information is present in the image, the more extended is the energy distribution. This shows that it is possible to derive the texture characteristics in relation to the energy distribution in the power spectrum. In particular, the characteristics of the texture in terms of spectral characteristics are obtained by dividing the Fourier domain into concentric circular regions of radius r that contains the energy that characterizes the level of fineness/roughness of the texture (high energy for large values of r that is high frequency implies the presence of fine structures) while high energy with small values of r that is low frequencies implies the presence of coarse structures. The energy evaluated in sectors of the spectral domain identified by the angle θ reflects the directionality characteristics of the texture. In fact, for the second and third image of Fig. 3.9, we have a localized energy distribution in the sectors in the range 40◦ –60◦ and in the range 130◦ –150◦ corresponding to the texture of the spaces between inclined bricks and inclined curved strips, respectively, present in the third image. The rest of the energy is distributed across all sectors and corresponds to the variability of the gray levels of the bricks and streaks.

280

3 Texture Analysis

The functions that can, therefore, be extracted by ring (centered at the origin) are tr1 r2 =

r2 π  

|F(r, θ )|2

(3.28)

|F(r, θ )|2

(3.29)

0=0 r =r1

and for orientation tθ1 θ2 =

θ2  R  θ=θ1 r =0

where r and θ represent the polar coordinates of the power spectrum √ r = u 2 + v2 θ = arctan(v/u)

(3.30)

The power spectrum |F(r, θ )|2 is expressed in polar coordinates (r, θ ) and considering its symmetrical nature with respect to the origin (u, v) = (0, 0) only the upper half of the frequencies u axis is analyzed. It follows, that the polar coordinates r and θ vary, respectively, for r = 0, R where R is the maximum radius of the outer ring, and θ varies from 0◦ to 180◦ . From functions tri ,r j and tθl ,θk can be defined n a × n s texture measures Tm,n n a ; n = 1, n s sampling the entire spectrum in n a rings and n s radial sectors as shown in Fig. 3.10. As an alternative to Fourier, other transforms can be used to characterize the texture. The choice must be made in relation to the better invariance of the texture characteristics with respect to the noise. The most appropriate choice is to consider combined spatial and spectral characteristics to describe the texture.

v

v

u

u

Fig. 3.10 Textural features from the power spectrum. On the left the image of the spectrum is partitioned into circular rings, each representing a frequency band (from zero to the maximum frequency) while on the right is shown the subdivision of the spectrum in circular sectors to obtain the information of direction of the texture in terms of distribution directional energy

3.7 Texture Based on the Edge Metric

281

3.7 Texture Based on the Edge Metric Texture measurements can be calculated by analyzing pixel contour elements, lines, and edges that constitute the components of local and global texture structures of an image. The edge metric can be calculated from the estimated gradient g(i, j) for each pixel of the image f (i, j) by selecting an appropriate edging operator with appropriate kernel size W or distance d between adjacent pixels. Texture measurements can be characterized by the module gm (i, j) and the direction gθ of the gradient, in relation to the type of texture to be analyzed. A texture feature can be expressed in relation to the edge density present in a window (for example, 3 × 3). For this purpose it is necessary to apply to the input image f (i, j) one of the already known algorithms for edge extraction (for example, the Laplace operator, or other operators, described in Chap. 1 Vol. II), to produce a map of edges B(i, j), with B(i, j) = 1 if there is a border, and B(i, j) = 0 in the opposite case. Normally, the map B is binarized with very low threshold values to define the edge elements pixels. A border density measurement is given as follows: TD (i, j) =

W 1  W2

W 

B(i + l, j + k)

(3.31)

l=−W k=−W

where W is the size of the square window of interest. Another feature of the texture is the edge contrast calculated as the local average of the module of the edges of the image TC (i, j) = media {B(i, j)} (i, j)∈W

(3.32)

where W indicates the image window on which the average value of the module is calculated. A high contrast of the texture occurs at the maximum values of the module. The contrast expressed by the (3.32) can be normalized by dividing it with the maximum value of the pixel in the window. The boundary density obtained with the (3.31) has the problem of finding an adequate threshold to extract the edges. This is not always easy considering that the threshold is applied to the entire image and is often chosen by trial and error. Instead of extracting the edges by first calculating the module with an edging operator, an alternative is given by calculating the gradient gd (i, j) as an approximation of the distance function between adjacent pixels for a defined window. The procedure involves two steps: 1. Calculation of the texture description function gd (i, j) depending on the distance d for all pixels in the image texture. This is done by calculating directly, from the input image f (i, j) to the variation of the distance d, the approximated gradient gd (i, j) = | f (i, j) − f (i + d, j)| + | f (i, j) − f (i − d, j)| + | f (i, j) − f (i, j + d)| + | f (i, j) − f (i, j − d)|

(3.33)

282

3 Texture Analysis

2. The texture measure T (d), based on the density of the edges, is given as the mean value of the gradient gd (i, j) for a given distance d (for example, d = 1) N N 1  gd (i, j) N2

T (d) =

(3.34)

i=1 j=1

where N × N is the pixel size of the image. The micro and macro texture structures are evaluated by the edge density expressed by the gradient T (d) related to the distance d. This implies that the dimensionality of the feature vector depends on the number of distances d considered. It is understood that the microstructures of the image are detected for values of d small, while the macrostructures are determined for large values (normally d assumes values to obtain from 1 to 10 feature of edge density) [19]. It can be verified that the function T (d) is similar to the negative autocorrelation function, with inverted peaks, its minimum corresponds to the maximum of the autocorrelation function, while its maximum corresponds to the minimum of the autocorrelation function. A measure based on edge randomness is expressed as a measure of the Shannon entropy of the gradient module TEr =

N  N 

gm (i, j) log2 gm (i, j)

(3.35)

i=1 j=1

A measure based on the edge directionality is expressed as a measure of the Shannon entropy of the direction of the gradient TE θ =

N  N 

gθ (i, j) log2 gθ (i, j)

(3.36)

i=1 j=1

Other measures on the periodicity and linearity of the edges are calculated using the direction of the gradient, respectively, through the co-occurrence of pairs of edges with identical orientation and the co-occurrence of pairs of collinear edges (for example, edge 1 with direction ←− and edge 2 with direction ←−, or with the opposite direction ←− −→).

3.8 Texture Based on the Run Length Primitives This method characterizes texture information by detecting sequences of pixels in the same direction with the same gray level (primitive). The length of these primitives run length characterize the structures of fine and coarse texture. The texture measurements are expressed in terms of gray level, length, and direction of the primitives

3.8 Texture Based on the Run Length Primitives

2 2 1 1

3 3 1 0

z Gray Levels

3 2 3 0

0 1 2 3

4 4 2 5

0 1 1 0

0 0 0 0

0 0 0 0

z Gray Levels

r Run length 1 2 3 4

Image

0 0 0 3

283

r Run length 1 2 3 4 0 1 2 3

Direction 0°

4 1 1 3

0 1 0 1

0 0 1 0

0 0 0 0

Direction 45°

Fig. 3.11 Example of calculating GLRLM matrices for horizontal direction and at 45◦ for a test image with gray levels between 0 and 3

which in fact represent pixels belonging to oriented segments of a certain length and the same gray level. In particular, this information is described by GLRLM matrices (Gray Level Run Length Matrix) reporting how many times sequences of consecutive pixels appear with identical gray level in a given direction [8,20]. In essence, any matrix defined for a given θ direction of primitives (called also r uns) can be seen as a two-dimensional histogram where each of its elements pθ (z, r ), identified by the gray level z e from the length r of the primitives, it represents the frequency of these primitives present in the image with maximum L gray levels and dimensions M × N . Therefore, a GLRLM matrix has the dimensions of L × R where L is the number of gray levels and R is the maximum length of the primitives. Figure 3.11 shows an example of a GLRLM matrix calculated for an image with a size of 5 × 5 with only 4 levels of gray. Normally, for an image, 4 GLRLM matrices are calculated for the directions 0◦ , 45◦ , 90◦ and 135◦ . To obtain an invariant rotation matrix p(z, r ) the GLRLM matrices can be summed. Several texture measures are then extracted from the statistics of the primitives captured by the p(z, r ) invariant to rotation. The original 5 texture measurements [20] are derived from the following 5 statistics: 1. Short Run Emphasis (SRE) TS R E =

L R 1   p(z, r ) Nr r2 z=1 r =1

where Nr indicates the total number of primitives Nr =

L R 1  p(z, r ) Nr z=1 r =1

This feature measure emphasizes short run lengths.

(3.37)

284

3 Texture Analysis

2. Long Run Emphasis (LRE) TL R E =

L R 1  p(z, r ) · r 2 Nr

(3.38)

z=1 r =1

This feature measure emphasizes long run lengths. 3. Gray-Level Nonuniformity (GLN) TG L N =

2 L  R 1   p(z, r ) Nr

(3.39)

z=1 r =1

This texture measure evaluates the distribution of runs on gray values. The value of the feature is low when runs are evenly distributed along gray levels. 4. Run Length Nonuniformity (RLN) TR L N

2 R  L 1   = p(z, r ) Nr

(3.40)

r =1 z=1

This texture measure evaluates the distribution of primitives in relation to their length. The value of TR L N is low when the primitives are equally distributed along their lengths. 5. Run Percentage (RP) Nr (3.41) TR P = M·N This feature measure evaluates the fraction of the number of realized runs and the maximum number of potential runs. The above mostly emphasize the length of the primitives (i.e., the vector measures L p(z, r ) which represents the sum of the distribution of the number pr (r ) = z=1 of primitives having length r ), without considering the information level of the gray level expressed by the vector pz (z) = rR=1 p(z, r ) which represents the sum of the distribution of the number of primitives having gray level z [21]. To consider also the gray-level information two new measures have been proposed [22]. 6. Low Gray-Level Run Emphasis (LGRE) TLG R E =

L R 1   p(z, r ) Nr z2

(3.42)

z=1 r =1

This texture measure, based on the gray level of the runs, is the same as the S R E, and instead of considering the short primitive, those with low levels of gray are emphasized.

3.8 Texture Based on the Run Length Primitives

285

7. High Gray-Level Run Emphasis (HGRE) TH G R E =

L R 1  p(z, r ) · z 2 Nr

(3.43)

z=1 r =1

This texture measure, based on the gray level of the runs, is the same as the L R E, and instead of considering the long primitives, those with high levels of gray are emphasized. Subsequently, by combining together the statistics associated with the length of the primitives and the gray level, 4 further measures have been proposed [23] 8. Short Run Low Gray-Level Emphasis (SRLGE) TS R LG E =

L R 1   p(z, r ) Nr z2 · r 2

(3.44)

z=1 r =1

This texture measure emphasizes the primitives shown in the upper left part of the GLRLM matrix, where the primitives with short length and low levels are accumulated. 9. Short Run High Gray-Level Emphasis (SRHGE) TS R H G E =

L R 1   p(z, r ) · z 2 Nr r2

(3.45)

z=1 r =1

This texture measure emphasizes the primitives shown in the lower left part of the GLRLM matrix, where the primitives with short length and high gray levels are accumulated. 10. Long Run Low Gray-Level Emphasis (LRLGE) TL R LG E =

R L 1   p(z, r ) · r 2 Nr z2

(3.46)

z=1 r =1

This texture measure emphasizes the primitives shown in the upper right part of the GLRLM matrix, where the primitives with long and low gray levels are found. 11. Long Run High Gray-Level Emphasis (LRHGE) TL R H G E =

R L 1  p(z, r ) · r 2 · z 2 Nr z=1 r =1

(3.47)

286

3 Texture Analysis

Table 3.1 Texture measurements derived from the GLRLM matrices according to the 11 statistics given by the equations from (3.37) to (3.47) calculated for the images of Fig. 3.12 Image SRE LRE

Tex_1 0.0594 40.638

GLN

5464.1

RLN

1041.3

0.1145 36.158 5680.1 931.18

Tex_5

Tex_4

Tex_3

Tex_2

0.0127

0.0542

115.76

67.276

4229.2

85.848

4848.5

1194.2

0.0174 6096.3

841.44

950.43

Tex_6

Tex_7

0.0348 0.03835 52.37

50.749

6104.4

5693.6

1115.2

1037

RP

5.0964

5.1826

4.5885

4.8341

5.2845

5.342

LGRE

0.8161

0.8246

0.7581

0.7905

0.8668

0.8408 0.82358

HGRE

2.261

2.156

2.9984

2.6466

1.7021

1.955

SGLGE

0.0480

0.0873

0.0102

0.0435

0.0153

0.0293 0.03109

SRHGE

0.1384

0.3616

0.0312

0.1555

0.0284

LRLGE

34.546

30.895

LRHGE

78.168

67.359

Tex_1

Tex_2

Tex_3

86.317 363.92

52.874 180.54

Tex_4

74.405 146.15

Tex_5

5.2083 2.084

0.0681 0.08397 44.812

42.847

95.811

97.406

Tex_6

Tex_7

Fig. 3.12 Images with natural textures that include fine and coarse structures

This texture measure emphasizes the primitives shown in the lower right part of the GLRLM matrix, where the primitives with long length and high gray levels are accumulated. The Table 3.1 reports the results of the 11 texture measures described above applied to the images in Fig. 3.12. The GLRLM matrices were calculated by scaling the images into 16 levels, and the statistics were extracted from the matrix p(z, r ) summation of the 4 directionals matrices.

3.9 Texture Based on MRF, SAR, and Fractals Models Let’s now see some methods based on the model that was originally developed for texture synthesis. When an analytical description of the texture is possible this can be modeled by some characteristic parameters that are subsequently used for the analysis of the texture itself. If this is possible, these parameters are used to describe the texture and to have its own representation (synthesis). The most widespread texture modeling is that of the discrete Markov Random Field-MRF that are optimal, to represent the local structural information of an image [24] and to classify the

3.9 Texture Based on MRF, SAR, and Fractals Models

287

texture [25]. These models are based on the hypothesis that the intensity in each pixel of the image depends only on the intensity of the pixels in its proximity to less than any additional noise. With this model, each pixel of the image f (i, j) is modeled as a linear combination of the intensity values of neighboring pixels and an additional noise n  f (i + l, j + k) · h(l, k) + n(i, j) (3.48) f (i, j) = (l,k)∈W

where W indicates the window that is the set of pixels in the vicinity of the current pixel (i, j) (where the window is centered almost always of size 3 × 3) and n(i, j) is normally considered a random Gaussian noise with mean zero and variance σ 2 . In this MRF model, the parameters are represented by the weights h(l, k) and by the noise n(l, k) which are calculated with the least squares approach, i.e., they are estimated by minimizing the error E expressed by the following functional: E=

 (i, j)

f (i, j) −



2 f (i + l, j + k) · h(l, k) + n(i, j)

(3.49)

(l,k)∈W

The texture of the model image is completely described with these parameters, which are subsequently compared with those estimated by the observed image to determine the texture class. A method similar to that of MRF is given by the Simultaneous AutoregressiveSAR model [26] which always uses the spatial relationship between neighboring pixels to characterize the texture and classify it. The SAR model is expressed by the following relationship:  f (i, j) = μ + f (i + l, j + k) · h(l, k) + n(i, j) (3.50) (l,k)∈W

where h and n, conditioned by W , it is still the model parameters characterizing the spatial dependence of the pixel under examination with respect to neighboring pixels, while, in the SAR model it is considered m as the bias, which is the average of the intensity of the input image. Also, in this case, all the parameters of the model (m, σ , h(l, k), n(i, j), and the window W ) can be estimated for a given image window using the least squares error (LSE) estimation approach, or the maximum likelihood estimation (MLE) approach. For both models, MRF and SAR, the texture characteristics are expressed by the parameters of the model (excluding m), used in the application contexts of segmentation and classification. A variant of the basic SAR model is reported in [27] to make the texture features (the parameters of the model) invariant to rotation and scale change. The fractal models[13] are used when some local image structures remain similar to themselves (self-similarity) observed at different scale changes. Mandelbrot [28] proposed fractal geometry to explain the structures of the natural world. Given a closed set A in the Euclidean space at size n, it is said to have the

288

3 Texture Analysis

property of being self-similar when A is the union of N distinct (nonoverlapping) copies of itself, each scaled down by a scale factor r . With this model a texture is characterized by the fractal dimension D which is given by the equation D=

log N log(1/r )

(3.51)

The fractal dimension is useful for characterizing the texture, and D expresses a measure of surface roughness. Intuitively, the larger the fractal dimension is, the more the surface is rough. In [13] it is shown that images with various natural textures can be modeled with spatially isotropic fractals. Generally, the texture related to many natural surfaces cannot be modeled with deterministic fractal models because they have statistical variations. From this, it follows that the estimation of the fractal dimension of an image is difficult. There are several methods for estimating the D parameter, one of which is described in [29] as follows. Given the closed set A, we consider windows of size L max of the side, such as to cover the set A. A scaled down version of a factor r of A will result in N = 1/r D similar sets. This new set can be enclosed by windows of size L = r L max , and therefore, their number is related to the fractal dimension D   L max 1 N (L) = D = (3.52) r L The fractal dimension is, therefore, estimated by the equation (3.52) as follows. For a given value of L, the n-dimensional space is divided into squares of side L and the number of squares covering A is counted. The procedure is repeated for different values of L and therefore the value of the fractal dimension D is estimated with the slope of the line (3.53) ln(N (L)) = −D ln(L) + D ln(L max ) which can be calculated using a linear least-squares interpolation of the available data, i.e., a plot of ln(L) to ln(N (L)). An improvement method to the previous one is suggested in [30], in which we assume to estimate the fractal dimension of a surface of the image A. Let p(m, L) be the probability that there are m intensity points within a square window of size L centered at a random position on the image A, we have n(m, L) · m p(m, L) = M where n(m, L) is the number of window containing m points and M is the total number of pixels in the image. When overlapping windows of size L on the image, then the value (M/m)P(m, L) represents the expected number of windows with m points inside. The expected number of windows covering the entire image is given by

3.9 Texture Based on MRF, SAR, and Fractals Models

E[N (L)] = M

N 

(1/m)P(m, L)

289

(3.54)

m=1

The expected value of N (L) is proportional to L −D , and therefore, can be used to estimate the fractal dimension D. In [29] it has been shown that the fractal dimension is not sufficient to capture all the textural properties of an image. In fact, there may be textures that are visually different but have similar fractal dimensions. To obviate this drawback a measure was introduced, called lacunarity4 which actually captures textural properties so as to be in accord with human perception. The measurement is defined by 2  M −1 =E (3.55) E(M) where M is the mass (understood as the set of pixel entities) of the fractal set and E(M) its expected value. This quantity measures the discrepancy between the current mass and the expected mass. There are small lacunarity values when the texture is fine, while with large values there are coarse texture. The mass of the fractal set is related to the length L in the following way M(L) = K L D

(3.56)

The probability distribution P(m,L) can be calculated as follows. Let M(L) =  N N 2 2 m=1 m P(m, L) and M (L) = m=1 m P(m, L), the lacunarity is defined as follows: M 2 (L) − (M(L))2 (L) = (3.57) (M(L))2 It is highlighted that M(L) and M 2 (L) are, respectively, the first and second moment of the probability distribution P(m, L). This lacunarity measurement of the image is used as a feature of the texture for segmentation and classification purposes.

4 Lacunarity,

originally introduced by Mandelbrot cite Mandelbrot, is a term in fractal geometry that refers to a measure of how patterns fill space. Geometric objects appear more lacunar if they contain a wide range of empty spaces or holes (gaps). Consequently, the lacunarity can be thought of as a measure of “gaps” present, for example, in an image. Note that high lacunarity images that are heterogeneous at small scales can be quite homogeneous at larger scales or vice versa. In other words, lacunarity is a scale-dependent measure of the spatial complexity of patterns. In the fractal context the lacunarity, being also a measure of spatial heterogeneity, can be used to distinguish between images that have similar fractal dimensions but look different from the other.

290

3 Texture Analysis

3.10 Texture by Spatial Filtering The texture characteristics can be determined by spatial filtering (see Sect. 9.9.1 Vol. I) by choosing a filter impulse response that effectively accentuates the texture’s microstructures. For the purpose Laws [31] proposed texture measurements using the convolution of the image f (i, j) with filtering masks h(i, j) of dimensions 5×5 that represent the impulse responses of the filter to detect the different characteristics of textures in terms of uniformity, density, granularity, disorder, directionality, linearity, roughness, frequency, and phase. From the results of convolutions ge = f (i, j)  h e (i, j), with various masks, the relative texture measurements Te are calculated, which express the energetic measure of the texture microstructures detected, such as edges, wrinkles, homogeneous, pointlike, and spot structures. This diversity of texture structures is captured with different convolution operations using appropriate masks defined as follows. It starts with three simple 1D masks L3 = [ 1 2 1] L − Level E3 = [−1 0 1] E − Edge (3.58) S3 = [−1 2 − 1] S − Spot where L3 represents a local mean filter, E3 represents an edge detector filter (at the first difference), and S3 represents a spot detector filter (at the second difference). Through the convolution operator of these masks with himself and each other, the following 1D basic 5×1 masks are obtained       L5 = L3  L3 = 1 2 1  1 2 1 = 1 4 6 4 1       E5 = L3  E3 = 1 2 1  −1 0 1 = −1 −2 0 2 1       S5 = L3  S3 = 1 2 1  −1 2 −1 = −1 0 2 0 −1       R5 = S3  S3 = −1 2 −1  −1 2 −1 = 1 −4 6 −4 1       W 5 = E3  (−S3) = −1 0 1  1 −2 1 = −1 2 0 −2 1

(3.59) (3.60) (3.61) (3.62) (3.63)

These basic masks 5 × 1 represent the filters, respectively, of smoothing (e.g., Gaussian) L5, of detectors of edges (e.g., gradient) E5, of Spot (e.g., Laplacian of Gaussian-LOG) S5, of crests R5 and of wave structures W 5. From these basic masks one can derive the two-dimensional 5×5 through the external product between the same 1D masks and between different pairs. For example, the masks E5L5 and L5E5 are obtained from the external product, respectively, between E5 and L5, and between L5 and E5, as follows: ⎡ ⎤ ⎡ ⎤ −1 −4 −6 −4 −1 −1 ⎢ −2 −8 −12 −8 −1 ⎥ ⎢ −2 ⎥ ⎢ ⎥ ⎢ ⎥ T ⎥ ⎢ ⎥ E5L5 = E5 × L5 = ⎢ 0 ⎥ × [1 4 6 4 1] = ⎢ ⎢ 0 0 0 0 0 ⎥ (3.64) ⎣ 2 8 12 8 1 ⎦ ⎣ 1 ⎦ 1 4 6 4 1 2

3.10 Texture by Spatial Filtering

291

⎡ ⎡ ⎤ −1 −2 1 ⎢ −4 −8 ⎢4⎥ ⎢ ⎢ ⎥ ⎢ ⎥ L5E5 = L5T × E5 = ⎢ ⎢ 6 ⎥ × [−1 −2 0 2 1] = ⎢ −6 −12 ⎣ −4 −8 ⎣4⎦ −1 −2 1

0 0 0 0 0

2 8 12 8 2

⎤ 1 4⎥ ⎥ 6⎥ ⎥ 4⎦ 1

(3.65)

The mask E5L5 detects the horizontal edges and simultaneously executes a local average in the same direction, while the mask L5E5 detects the vertical edges. The number of Laws 2D masks that can be obtained is 25, useful for extracting different texture structures present in the image. The essential steps of the Laws algorithm for extracting texture characteristics based on local energy are the following: 1. Removing lighting variations. Optional pre-processing step of the input image f (i, j) which removes the effects of the lighting variation. The initial value of each pixel of the image is replaced by subtracting from it the value of the average of the local pixels included in the window of appropriate dimensions (for natural scenes normally the dimensions are 15 × 15) and centered in it. 2. Pre-processed image filtering. The pre-processed image f (i, j) is filtered using the 25 convolution masks 5 × 5 previously calculated with the external product of the one-dimensional masks L5, E5, S5, R5, W5 given by the equations from (3.59) to (3.63). For example, considering the mask E5L5 given by the (3.64) we get the filtered image g E5L5 (i, j) as follows: g E5L5 (i, j) = f (i, j)  E5L5

(3.66)

In reality, of the 25 images filtered with the corresponding 25 2D-masks, those useful for extracting the texture characteristics are 24, since the image g L5L5 filtered with the mask L5L5 (Gaussian smoothing filter) is not considered. Furthermore, to simplify, the one-dimensional basic mask W 5 given by the (3.63) can be excluded and in this case the 2D masks are reduced to 16 and consequently there would be 16 filtered images g. 3. Calculation of texture energy images. From the filtered images g the images of texture energy T (i, j) are calculated. An approach for calculating the value of each pixel T (i, j) is to consider the summation of the absolute values of the pixels close to the pixel under examination (i, j) in g belonging to the window of dimensions (2W + 1) × (2W + 1). Therefore, a generic texture energy image is given by i+W  j+W  |g(m, n)| (3.67) T (i, j) = m=i−W n= j−W

292

3 Texture Analysis

With the (3.67) we will have the set of 25 images of texture energy {T X 5X 5 (i, j)} X =L ,E,S,R,W if 5 basic 1D-masks are used or 16 if the first 4 are used, i.e., L5, E5, S5, R5. A second approach for the calculation of texture energy images is to consider, instead of the absolute value, the square root of the value of neighboring pixels, as follows: i+W  j+W   g 2 (m, n) (3.68) T (i, j) = m=i−W n= j−W

4. Normalization of texture energy images. Optionally, the set of energy images T (i, j) can be normalized with respect to the image g L5L5 obtained with the Gaussian convolution filter L5L5, unique filter with nonzero sum (see filter E5 given by the 3.59), while all the others have as average the sum zero thus avoiding to amplify or attenuate the energy of the system (see for example, the filter E5L5 given by the 3.64). Therefore, the new set of texture energy images indicated with Tˆ (i, j) are given as follows:  j+W i+W m=i−W n= j−W |g(m, n)| ˆ T (i, j) = i+W  j+W m=i−W n= j−W |g L5L5 (m, n)|  j+W  2 i+W g (m, n) m=i−W n= j−W  Tˆ (i, j) =   j+W i+W g 2L5L5 (m, n) m=i−W n= j−W

(3.69)

Alternatively, energy measurements of normalized textures can also be expressed in terms of standard deviation, calculated as Tˆ (i, j) =

Tˆ (i, j) =

1 (2W + 1)2

1 (2W + 1)

i+W 

j+W 

|g(m, n) − μ(m, n)|

(3.70)

[g(m, n) − μ(m, n)]2

(3.71)

m=i−W n= j−W

   i+W    2

j+W 

m=i−W n= j−W

where μ(i, j) is the local average of the texture measure g(i, j), relative to the window (2W + 1) × (2W + 1) centered in the pixel being processed (i, j), estimated by μ(i, j) =

1 (2W + 1)2

i+W 

j+W 

m=i−W n= j−W

gW (m, n)

(3.72)

3.10 Texture by Spatial Filtering

293

5. Significant images of texture energy. From the original image f (i, j) we have: the set {T X 5X 5 (i, j)} X =L ,E,S,R,W of 25 or the reduced set {T X 5X 5 (i, j)} X =L ,E,S,R of 16 energy images. The energy image TL5L5 (i, j) is not meaningful to characterize the texture unless we want to consider the contrast of the texture. The remaining 24 or 15 energy images can be further reduced by combining some symmetrical pairs replacing them with the average of their sum. For example, we know that TE5L5 and TL5E5 represent the energy of vertical and horizontal structures (variants to rotation), respectively. If they are added we have as a result the energy image TE5L5/L5E5 corresponding to the module of the edge (texture measurement invariant to rotation). The other images energy TE5E5 , TS5S5 , TR5R5 and TW 5W 5 are used directly (rotation invariant measures). Therefore, using 5 one-dimensional bases L5, E5, S5, R5, W 5, after the combination we have the following 14 energy images TE5L5/L5E5 TS5L5/L5S5 TW 5L5/L5W 5 TR5L5/L5R5 TE5E5 TS5E5/E5S5 TW 5E5/E5W 5 TR5E5/E5R5 (3.73) TS5S5 TW 5S5/S5W 5 TR5S5/S5R5 TW 5W 5 TR5W 5/W 5R5 TR5R5 while, using the first 4 masks, one-dimensional bases, we have the following 9 energy images TE5L5/L5E5 TS5L5/L5S5 TR5L5/L5R5 TE5E5 TS5E5/E5S5 TR5E5/E5R5 (3.74) TS5S5 TR5R5 TR5S5/S5R5 In summary, to characterize various types of textures with the Laws method, we have 14 or 9 images of energy, or we have for each pixel of the input image f (i, j) 14 or 9 texture measurements. These texture measurements are used in different applications for image segmentation and classification. In relation to the nature of the texture, to better characterize the microstructures present at various scales, it is useful to verify the impact of the size of the filtering masks on the discriminating power of the texture measurements T . In fact, the method of Laws has also been experimented using the one-dimensional masks 3 × 1 date in (3.58), from which the two-dimensional ones 3×3 were derived through the external product between the same 1D masks and between different couples. In this case, the one-dimensional mask relating to the R3 crests is excluded as it cannot be reproduced in the mask 3 × 3, and the 2D derivable masks are the following:

294

3 Texture Analysis

⎤ −1 2 −1 L3S3 = ⎣ −2 4 −2 ⎦ −1 2 −1 ⎤ ⎡ −1 2 −1 ⎣ E3S3 = 0 0 0 ⎦ 1 −2 1 ⎤ ⎡ 1 −2 1 S3S3 = ⎣ −2 4 −2 ⎦ 1 −2 1

⎤ 1 0 −1 L3E3 = ⎣ 2 0 −2 ⎦ 1 0 −1 ⎤ ⎡ 1 0 −1 ⎣ E3E3 = 0 0 0 ⎦ −1 0 1 ⎤ ⎡ −1 0 1 S3E3 = ⎣ 2 0 −2 ⎦ −1 0 1

⎤ 121 L3L3 = ⎣ 2 4 2 ⎦ 121 ⎤ ⎡ −1 −2 −1 ⎣ E3L3 = 0 0 0 ⎦ 1 2 1 ⎤ ⎡ −1 −2 −1 S3L3 = ⎣ 2 4 2 ⎦ −1 −2 −1







(3.75) With the masks 3 × 3, after the combinations of the symmetrical masks we have the following 5 images available: TE3L3/L3E3 TS3L3/L3S3 TE3E3 TE3S3/S3E3 TS3S3

(3.76)

Laws energy masks have been applied to the images in Fig. 3.12 and in the Table 3.2 are reported for each image the texture measurements derived from the 9 significant energy images obtained by applying the process described above. The 9 energy images reported in (3.74) were used. A window with a size of 7 × 7 was used to estimate local energy measurements with the (3.68) and then normalized with respect to the original image (smoothed with mean filter). Using a larger window does not change the results and would result in a larger computational time. Laws tested the proposed method on a sample mosaic of Brodatz texture fields and was identified at 90%. Laws’texture measurements have been extended for volumetric analysis of 3D textures [32]. In analogy to the masks of Laws, Haralick proposed those used for the extraction of the edges for the measurement of the derived texture with the following basic masks:

Table 3.2 Texture measurements related to the images in Fig. 3.12 derived from the energy equation images 3.74 Image

Tex_1

Tex_2

Tex_3

Tex_4

Tex_5

Tex_6

Tex_7

L5E5/E5L5

1.3571

2.0250

0.5919

0.9760

0.8629

1.7940

1.2034

L5R5/R5L5

0.8004

1.2761

0.2993

0.6183

0.4703

0.7778

0.5594

E5S5/S5E5

0.1768

0.2347

0.0710

0.1418

0.1302

0.1585

0.1281

S5S5

0.0660

0.0844

0.0240

0.0453

0.0455

0.0561

0.0441

R5R5

0.1530

0.2131

0.0561

0.1040

0.0778

0.1659

0.1068

L5S5/S5L5

0.8414

1.1762

0.3698

0.6824

0.6406

0.9726

0.7321

E5E5

0.4756

0.6873

0.2208

0.4366

0.3791

0.4670

0.3986

E5R5/R5E5

0.2222

0.2913

0.0686

0.1497

0.1049

0.1582

0.1212

S5R5/R5S5

0.0903

0.1178

0.0285

0.0580

0.0445

0.0713

0.0523

3.10 Texture by Spatial Filtering

⎡ ⎤ 1 h 1 = 13 ⎣ 1 ⎦ h 2 = 1

295

⎡ ⎤ 1 1⎣ ⎦ 0 h3 = 2 1

⎤ 1 1⎣ −2 ⎦ 2 1

and the related two-dimensional masks are: ⎡ ⎤ ⎡ ⎤ 111 −1 −1 −1 1⎣ 1⎣ 1 1 1⎦ 0 0 0 ⎦ 9 6 111 1 1 1 ⎡ ⎤ ⎡ ⎤ 1 1 1 −1 0 1 1⎣ 1⎣ −2 −2 −2 ⎦ 0 0 0 ⎦ 6 4 1 1 1 1 0 −1 ⎡ ⎤ ⎡ ⎤ 1 0 −1 −1 2 −1 1⎣ 1⎣ −2 0 2 ⎦ 0 0 0 ⎦ 4 4 1 0 −1 1 −2 1





⎤ 1 0 −1 1⎣ 1 0 −1 ⎦ 6 1 0 −1 ⎡ ⎤ 1 −2 1 1⎣ 1 −2 1 ⎦ 4 1 −2 1 ⎡ ⎤ 1 −2 1 1⎣ −2 4 −2 ⎦ 4 1 −2 1

(3.77) Also these masks can be extended with dimensions 5×5 similarly to those previously calculated.

3.10.1 Spatial Filtering with Gabor Filters Another spatial filtering approach to extract texture features is based on the Gabor filters [33]. These filters are widely used for the analysis of the texture motivated by their spatial location nature, orientation, selectivity, and the frequency characteristic. Gabor filters are seen as the precursors of wavelets (see Sect. 2.12 Vol. II) where each filter captures energy at a particular frequency and for a specific direction. Their diffusion is motivated by mathematical properties and neurophysiological evidence. In 1946 Gabor showed that the specificity of a signal, simultaneously in time and frequency is fundamentally limited by a lower limit given by the product of its band length and duration. 1 . Furthermore, it found that signals of the form This limit is x ω ≥ 4π

2  t s(t) = exp − 2 + jωt α reach the theoretical limit who found. The Gabor functions form a complete set of basic (non-orthogonal) functions and allow you to expand any function in terms of these basic functions. Subsequently, Gabor’s functions were generalized in twodimensional space [34,35] to model the profile of receptive fields of simple cells of the primary visual cortex (also known as striated cortex or V 1)5 .

5 Psychovisual

redundancy studies indicate that the human visual system processes images at different scales. In the early stages of vision, the brain performs a sort of analysis in different spatial

296

3 Texture Analysis

As we shall see, these functions are substantially bandpass filters that can be treated together in the 2D spatial domain or in the Fourier 2D domain. These specific properties of Gabor 2D functions have motivated research to describe and discriminate the texture of images using the power spectrum calculated with Gabor filters [36]. In essence, it is verified that the texture characteristics found with this method are locally spatially invariant. Now let’s see how it is possible to define a bank of Gabor 2D filters to capture the energy of the image and detect texture measurements at a particular frequency and a specified direction. In the 2D spatial domain, the canonical elementary function of Gabor h(x, y) is a complex harmonic function (i.e., composed of the sine and cosine functions) modulated by a Gaussian oriented function g(xo , yo ), given in the form h(x, y) = g(xo , yo ) · exp [2π j (U x + V y)] (3.78) √ where j = −1, (U, V ) represents a particular 2D frequency in the frequency domain (u, v), and (xo , yo ) = (x cos θ + y sin θ, −x sin θ + y cos θ ) represents the geometric transformation of the coordinates to rotate an angle θ with respect to the x axis the Gaussian g(xo , yo ). The 2D oriented Gaussian function is given by   (xo /γ )2 + yo2 1 (3.79) exp − g(xo , yo ) = 2π γ σ 2σ 2 where γ is the spatial aspect ratio and specifies the ellipticity of the 2D-Gaussian (support of Gabor function), σ is the standard deviation of the Gaussian that characterizes the extension (scale) of the filter in the spatial domain, and the band, in the Fourier domain. If γ = 1, then the angle θ is no longer considered because the Gaussian (3.79) becomes with circular symmetry, simplifying the filter (3.78). The Gabor filter h(x, y) in the Fourier domain is given by    (3.80) H (u, v) = exp −2π 2 σ 2 (u o − Uo )2 γ 2 + (vo − Vo )2 where (u o , vo ) = (u cos θ + v sin θ, −u sin θ + v cos θ ) and (Uo , Vo ) produces a similar rotation of θ , in the frequency domain, with respect to the u axis. Furthermore, the (3.80) indicates that H (u, v) is a Gaussian bandpass filter, rotated by an angle θ

frequencies, and consequently, the visual cortex is composed of different cells that correspond to different frequencies and orientations. It has also been observed that the responses of these cells are similar to those of the Gabor functions. This multiscale process, which successfully takes place in the human vision for texture perception, has motivated the development of texture analysis methods that mimic the mechanisms of human vision.

3.10 Texture by Spatial Filtering

297

Fig. 3.13 Gabor filter in the frequency domain with the elliptical support centered in F(U, V )

H(u,v)

v

F

V α

U

u

with respect to the axis u, with aspect ratio 1/γ . The complex exponential represents a complex 2D harmonic with radial central frequency  F = U2 + V 2 (3.81) and orientation given by φ = tan−1

V U

(3.82)

where φ is the orientation angle of the sinusoidal harmonic, with respect to the frequency axis u, in the Fourier domain (u, v) (see Fig. 3.13). Figure 3.14 shows instead the 3D and 2D graphic representation of the real and imaginary components of a Gabor function. Although Gabor’s filters may have the Gaussian function of modulating support with arbitrary direction, in many applications it is useful that the modulating Gaussian function has the same orientation as the complex sinusoidal harmonic, i.e., θ = φ. In that case, the (3.78) and (3.80) are reduced, respectively,

and

h(x, y) = g(xo , yo ) · exp [2π j F xo ]

(3.83)

   H (u, v) = exp −2π 2 σ 2 (u o − F)2 γ 2 + vo2

(3.84)

At this point, it is understood that the appropriate definition of Gabor’s filters is expressed in terms of their spatial frequency and orientation bandwidth. It is observed from the (3.78) that the Gabor function responds significantly to a limited range of signals that form a repetitive structure in some direction (usually coinciding with the orientation of the filter) and are associated with some frequencies (i.e., the filter band). To have usefulness of the (3.78), the frequency domain must be mapped by the filter banks in terms of radial frequencies and orientation bandwidth so that their impulsive response characterizes the texture present in the image. Figure 3.15 shows schematically, in the 2D Fourier domain, the arrangement of the responses of a Gabor filter bank, where each elliptic region represents the range of frequencies and orientation so that some filters respond with a strong signal.

298

3 Texture Analysis Theta= 0

Theta= 0

1 0.5

0.5

0

0

−0.5

−0.5

30

30 20 10

5

30 20 25 10 15

20 10

Cosine component

5

30 20 25 15 10

Sine component

Fig. 3.14 Perspective representation of the real component (cosine) and of the imaginary component (sine) of a Gabor function with a unitary aspect ratio v

H(u,v)

u

Fig.3.15 Support in the frequency domain of the Gabor filter bank. Each elliptical region represents a range of frequencies for which some filters respond strongly. Regions that are on the same ring support filters with the same radial frequency, while regions at different distances from the origin F but with identical direction correspond to different scales. In the example shown on the left, the filter bank would have 3 scales and 3 directions. The figure on the right shows the frequency responses in the spectral domain H (u, v) of a filter bank in 5 scales and 8 directions

Regions included in a ring correspond to filters with the same radial frequency. While regions at different distances from the origin but with the same direction correspond to filters with different scales. The goal of defining the filter bank is to map the different textures of an image in the appropriate region that represents the filter’s characteristics in terms of frequencies and direction. Gabor’s basic 2D functions are generally spatially localized, oriented and with an octave bandwidth.6

6 We

recall that it is customary to divide the bands with constant percentage amplitudes. Each band is characterized by a lower frequency f i and a higher frequency f s and a central frequency

3.10 Texture by Spatial Filtering

299

The frequency B and orientation (expressed, respectively, in octave bands and radians), half the bandwidth of the Gabor filter given by the (3.83), are (see Fig. 3.16):   π Fγ σ + α B = log2 (3.85) π Fγ σ − α   α (3.86) = 2 tan−1 π Fσ √ where α = (ln 2)/2. A bank of Gabor filters of arbitrary direction and bandwidth can be defined, varying the 4 free parameters θ, F, σ, γ (or , B, σ, γ ) and extending the elliptical regions of the spatial frequency domain with the major axis passing through the origin (see Fig. 3.15). In general, we tend to cover the frequency domain with a limited number of filters and to minimize the overlap of the filter support regions. From the (3.83) we observe that, for the sinusoidal component, the Gabor function h(x, y) is a complex function with a real and imaginary part. The sinusoidal component is given by exp(2π j F xo ) = cos(2π F xo ) + j sin(2π j F xo )

(3.87)

and the real (cosine) and imaginary (sine) components of h(x, y) are (see Fig. 3.14) h c,F,θ (x, y) = g(xo , yo ) cos(2π F xo )

(3.88)

h s,F,θ (x, y) = g(xo , yo ) sin(2π F xo )

(3.89)

The functions h c,F,θ and h s,F,θ are, respectively, even (with symmetry with respect to the x axis) and odd (with symmetry with respect to the origin), and symmetric in the direction of θ . To get Gabor texture measurements TF,θ , an image I (x, y) is filtered with Gabor’s filters (3.88) and (3.89) through the convolution operation, as follows: Tc,F,θ (x, y) = I (x, y)  h c,F,θ (x, y)

Ts,F,θ (x, y) = I (x, y)  h s,F,θ (x, y) (3.90) The result of the two convolutions is almost identical to less than the phase difference of π/2 in the θ direction. From the texture measures Tc,F,θ and Ts,F,θ obtained it is useful to calculate the energy E F,θ and the amplitude A F,θ

f c . The most frequently used bandwidths are the octave where the lower and upper extremes are in the ratio√1 : 2, or f s = 2Fi . The bandwidth percentage √ is given by ( f s − di )/dc = constant, and f c = f i · f s . We also have 1/3 octave bands f s = 3 2 · f i , where the width of each band is narrower, equal to 23.2% of the central nominal frequency of each band.

300

3 Texture Analysis

y

hc,Fθ(x,y) v

H(u,v)

x Ω Real Component

F u

y

hs,Fθ(x,y)

B

x Imaginary Component

Fig. 3.16 Bandwidth detail B and orientation of the frequency domain support of a Gabor filter, expressed by the (3.83), of which, in the spatial domain, the real and imaginary components are represented

2 2 E F,θ (x, y) = Tc,F,θ (x, y) + Ts,F,θ (x, y)  A F,θ (x, y) = E F,θ (x, y)

(3.91) (3.92)

while the average energy calculated on the entire image is given by 1  Eˆ F,θ = E F,θ (x, y) N x y

(3.93)

where N is the number of pixels in the image.

3.10.1.1 Application of Gabor Filters The procedure used to extract texture measurements from an image based on a bank of Gabor filters depends on the type of application. If the textures to be described are already known, the goal is to select the best filters with which to calculate the energy images and extract from these the texture measurements that well discriminate such texture models. If instead, the image has different textures from the energy images, a set of texture measurements are extracted in relation to the number of filters defined on the basis of the scale and orientation used. This set of texture measurements are then used to segment the image with one of the algorithms described in Chap. 1, for example the algorithm, e.g., K-means. The discrimination of texture measurements can be evaluated by calculating the Euclidean distance or using other metrics. The essential steps for calculating texture measurements with a Gabor filter bank can be the following:

3.10 Texture by Spatial Filtering

301

1. Select the free parameters θ, F, σ, γ based on which the number of filters is also defined. 2. Design of the filter bank with the required scale and angular resolution characteristics. 3. Pr epar e the 2D input image I (x, y) which can be gray or color level, but in this case, the RGB components are treated separately or combined into a single significant component through, for example, the transform to principal components (PCA). 4. Decompose the input image using the filters with the convolution operator (3.90). 5. E xtract texture measurements from energy images (3.91) or amplitude (3.92). The input image and these texture measurements can be smoothed with low pass filters to attenuate any noise. In this case, it should be remembered that the components of low frequencies (representing contrast and intensity) remain unchanged while those of high frequency (details such as the edges) are attenuated, thus obtaining a blurred image. 6. In relation to the type of application (segmentation, classification, ...) in the space of the measures (feature) extracted characterize each pixel of the input image. Figure 3.17 shows a simple application of Gabor filters to segment the texture of an image. The input image (a) has 5 types of natural textures, not completely uniform. The features of the textures are extracted with a bank of 32 Gabor filters in 4 scales and 8 directions (figure (d)). The number of available features is, therefore, 32 × 51 × 51 after sampling the image of size 204 × 204 by a factor of 4. Figure (b) shows the result of the segmentation by applying K-means algorithm to the feature images extracted with the Gabor filter bank, while the figure (c) shows the result of the segmentation applying Gabor filters defined in (d) after reducing the feature to 5 × 51 × 51 applying the data reduction with the principal components (PCA) (see Sect. 2.10.1 Vol. II). Another approach to texture analysis is based on the wavelet transform where the input image is decomposed at various levels of subsampling to extract different image

(a)

(b)

(c)

(d)

θ

direction

scale

σ

Fig. 3.17 Segmentation of 5 textures not completely uniform. a Input image; b segmented by applying K-means algorithm to the features extracted with the Gabor filter bank shown in (d); c segmented image after reducing the feature images to 5 with the PCA; d the bank of Gabor filters used with 4 scales and 8 directions

302

3 Texture Analysis

details [37,38]. Texture measurements are extracted from the energy and variance of the subsampled images. The main advantage of wavelet decomposition is that it provides a unified multiscale context analysis of the texture.

3.11 Syntactic Methods for Texture The syntactic description of the texture is based on the analogy between the spatial relation of texture primitives and the structure of a formal language [39]. The descriptions of the various classes of textures form a language that can be represented by its grammar which constitutes its rules by analyzing the primitives of the sample textures (training set) in the learning phase. The syntactic description of the texture is based on the idea that the texture is composed of primitives repeated and arranged in a regular manner in the image. The syntactic methods of texture, in order to fully describe it, must be determined essentially by the primitives and rules by which these primitives are spatially arranged and how they are repeated. A typical syntactic solution involves using grammar with rules that generate the texture of primitives using transformation rules for a limited number of symbols. The symbols represent in practice various types of texture primitives while the transformation rules represent the spatial relations between the primitives. The syntactic approach must, however, foresee that the textures of the real world are normally irregular with the presence of errors in the structures repeated in an unpredictable way and with considerable distortions. This means that the rules of grammar may not efficiently describe real textures if they are not variables and the grammar must be of various types (stochastic grammar). Let’s consider a simple grammar for the generation of the texture starting with a starting symbol S and applying the transformation rules called shape rules. The texture is generated through various phases: 1. Activate the texture generation process by applying some transformation rules to the start symbol S. 2. Find a part of the texture generated in step 1 that is comparable with the first member of some of the expected transformation rules. A correct comparison must be verified between the terminal and nonterminal symbols that appear in the first member of the transformation rules chosen with the corresponding terminal and nonterminal symbols of part of the texture to which the rule was applied. If part of this texture is not found the algorithm ends. 3. Find an appropriate transformation that can be applied to the first member of the rule chosen to make it perfectly coincide with the considered texture. 4. Apply this geometric transformation to the second member of the transformation rule. 5. Replace the specified part of texture (the transformed portion that coincides with the first member of the chosen rule) with that transformed by the second member of the chosen rule. 6. Continue from step 2.

3.11 Syntactic Methods for Texture Fig. 3.18 Grammar G to generate a texture with hexagonal geometric structures

Fig. 3.19 Example of texture with hexagonal geometric structures: a texture recognized and b unrecognized texture

(a)

303 Vt = {

}

Vn = {

}

S={

}

R:

...

(b)

We explain with an example the algorithm presented for a grammar G = [Vn , Vt , P, S]. Let Vn be nonterminal symbols and Vt the set of terminal symbols, R the set of rules and S the starting symbol. As an example of grammar, we consider the one shown in Fig. 3.18. This grammar is used to generate a texture with hexagonal structures with hexagonal geometrical primitives that are replicated by applying the individual rules of R several times. With this grammar it is possible to analyze the image of Fig. 3.19 to recognize or reject the hexagonal texture represented. The recognition involves searching in the image first for the hexagonal primitives of the texture and then checking if it is comparable with some that are on the second member of the R transformation rules. In essence, the recognition process takes place by applying the rules to a given texture in the reverse direction, until the initial shape is reproduced.

3.12 Method for Describing Oriented Textures In different applications, we are faced with so-called oriented textures, that is, primitives are represented by a local orientation selectivity that varies in the various points of the image. In other words, the texture shows a dominant local orientation and in this case we speak of a texture with a high degree of local anisotropy. To describe and visualize this type of texture it is convenient to think of the gray-level image as representing a flow map where each pixel represents a fluid element subjected to a motion in the dominant direction of the texture, that is, in the directions of maximum variation of the levels of gray. In analogy to what happens in the study of fluid dynamics, where each particle is subject to a velocity vector that is composed of its module and direction, even in the case of images with oriented textures, we can define a texture orientation field simply called Oriented Texture Field-OTF, which is actually composed of two images: the image of orientation and the image of coherence. The orientation image includes the local information of orientation of the texture for each pixel, the image of coherence represents the degree of anisotropy always

304

3 Texture Analysis

in each pixel of the image. The images of the oriented texture fields as proposed by Rao [40] are calculated with the following five phases: 1. 2. 3. 4.

Gaussian filtering to attenuate the noise present in the image; Calculation of the Gaussian image gradient; Estimating the local orientation angle using the inverse tangent function; Calculation of the average of local orientation estimates for a given window centered on the pixel being processed; 5. Calculation of a coherence estimate (texture flow information) for each image point. The first two phases, as known, are realized with standard edge extraction algorithms (see Chap. 1 Vol. II). Recall that the Gaussian gradient operator is an optimal solution for edge extraction. We specify also that the Gaussian filter is characterized by the standard deviation of the Gaussian distribution σ that defines the level of detail with which the geometric figures of the texture are extracted. This parameter, therefore, indicates the degree of detail (scale) of the texture to be extracted. In the third phase the local orientation of texture is calculated by means of the inverse tangent function which requires only one argument as input and providing in output a unique result in the interval (−π /2, π /2 ). With edge extraction algorithms normally the maximum gradient direction is calculated with the arctangent function which requires two arguments and does not provide a unique result. In the fourth phase the orientation estimates are smoothed by a Gaussian filter with standard deviation σ 2 . This second filter must have a greater standard deviation than the previous one (σ2 σ1 ), and must produce a significant leveling between the various orientation estimates. The value of σ2 must, however, be smaller than the distance within which the orientation of the texture has the widest variations, and finally, it must not attenuate (blurring) the details of the texture itself. The fifth phase calculates the texture coherence which is the dominant local orientation estimate, i.e., the normalized direction where most of the directional vectors of neighboring pixels are projected. If the orientations are coherent, then the normalized projections will have a value close to unity, on the contrary case, the projections tend to cancel each other producing a result close to zero.

3.12.1 Estimation of the Dominant Local Orientation Consider an area of the image with different segments whose orientations indicate the local arrangement of the texture. One could calculate as the dominant direction the one corresponding to the resulting vector sum of the single local directions. This approach would have the disadvantage of not being able to determine a single direction as there would be two angles θ and θ + π . Another drawback would occur if we considered oriented segments, as some of these with opposite signs would cancel each other out, instead of contributing to the estimation of the dominant orientation.

3.12 Method for Describing Oriented Textures

305

Fig. 3.20 Calculation of the dominant local orientation θ considering the orientation of the gradient of the neighboring pixels

Rj

θj

θ x

Rao suggests the following solution (see Fig. 3.20). Let N be the local segments, and consider a line oriented at an angle θ with respect to the horizontal axis x. Consider a segment j with angle θ j and with R j we denote its length. The sum of absolute values of all the projections of all the other segments is given by S1 =

N 

|R j · cos(θ j − θ )|

(3.94)

j=1

where S1 varies with the orientation θ of the considered line. The dominant orientation is obtained for a value of θ where S1 is maximum. In this case, θ is calculated by setting the derivative of the function S1 with respect to θ to zero. To eliminate the problem of differentiation of the absolute value function (not differentiable) it is convenient to consider and differentiate the following sum S2 : S2 =

N 

R 2j · cos2 (θ j − θ )

(3.95)

j=1

which derived with respect to θ is obtained  dS2 =− 2R 2j · cos(θ j − θ ) sin(θ j − θ ) dθ N

j=1

Recalling the trigonometric formulas of double angle and then of sine addition, and setting equal to zero, we obtain the following equations: −

N 

R 2j · sin 2(θ j − θ ) = 0

j=1

from which

=⇒

N 

R 2j · sin 2θ j cos 2θ =

j=1

N

R 2j · cos 2θ j sin 2θ

j=1

j=1

R 2j · sin 2θ j

j=1

R 2j · cos 2θ j

tan 2θ =  N

N 

(3.96)

If we denote by θ  the value of θ for which the maximum value of S2 is obtained, this coincides with the best estimate of local dominant orientation. Now let’s see how the

306

3 Texture Analysis

G(xj yj) θ(xj j)

Fig. 3.21 Coherence calculation of texture flow fields

Projection of the gradient vector in the direction θ(x

(xj j) θ(x x

Window WxW

previous equation (3.96) is used for the calculation of the dominant orientation in each pixel of the image. Let us consider with gx and g y the horizontal and vertical components of the gradient at each point of the image, and the complex quantity gx + ig y which constitutes the representation of the same pixel in the complex plane. The gradient vector at a point (m, n) of the image can be represented in polar coordinates with R m,n ei θ m,n . At this point we can calculate the dominant local orientation angle θ for a neighborhood of (m, n) defined by the N × N pixel window as follows:

N N  2 1 m=1 n=1 Rm.n · sin 2θm,n −1 θ = tan N N 2 2 m=1 n=1 Rm,n · cos 2θm,n

(3.97)

The dominant orientation θ in the point (m, n) is given by θ + π/2 because the gradient vector is perpendicular to the direction of anisotropy.

3.12.2 Texture Coherence Let G(x, y) the magnitude of the gradient calculated in phase 2 (see Sect. 3.12) in the point (x, y) of the image plane. The measure of the coherence in the point (x0 , y0 ) is calculated considering at this point centered a window of size W × W (see Fig. 3.21). For each point (xi , y j ) of the window the gradient vector G(xi , y j ) considered in the direction θ (xi , y j ) in the unit vector in the direction θ (x0 , y0 ) is projected. In other words, the projected gradient vector is given by G(xi , yi ) · cos[θ (x0 , y0 ) − θ (xi , yi )] The normalized sum of the absolute values of these projections of gradient vectors included in the window is considered as an estimate κ of the coherence measure  (i, j)∈W |G(x i , yi ) · cos[θ (x 0 , y0 ) − θ (x i , yi )]|  κ= (3.98) (i, j)∈W G(x i , yi )

3.12 Method for Describing Oriented Textures

Original

Orientations

307

Coherence

Fig. 3.22 Calculation of the orientation map and coherence measurement for two images with vertical and circular dominant textures

This measure is correlated with the dispersion of data directionality. A better coherence measure ρ is obtained by weighing the value of the estimate κ, given by the previous (3.98), with the magnitude of the gradient at the point (x 0 , y 0 )  (i, j)∈W |G(x i , yi ) · cos[θ (x 0 , y0 ) − θ (x i , yi )]|  (3.99) ρ = G(x0 , y0 ) (i, j)∈W G(x i , yi ) In this way the coherence is presented with high values, in correspondence with high values of the gradient, i.e., where there are strong local variations of intensity in the image (see Fig. 3.22).

3.12.3 Intrinsic Images with Oriented Texture The images of coherence and of the dominant orientation previously calculated are considered as intrinsic images (primal sketch) according to the paradigm of Marr [41] which we will describe in another chapter.7 These images are obtained with an approach independent of the applicability domain. They are also independent of light conditions. Certain conditions can be imposed in relation to the type of application to produce appropriate intrinsic images. These intrinsic images find a field of use for the inspection of defects in the industrial automation sector (wood

7 Primal sketch indicates the first information that the human visual system extracts from the scene and in the context of image processing are the first features extracted such as borders, corners, homogeneous regions, etc. A primal sketch image we can think of as equivalent to the significant traits that an artist draws as his expression of the scene.

308

3 Texture Analysis

defects, skins, textiles, etc.). From these intrinsic images, it is possible to model primitives of oriented textures (spirals, ellipses, radial structures) to facilitate image segmentation and interpretation.

3.13 Tamura’s Texture Features Tamura et al. in [42] described an approach based on psychological experiments to extract the texture features that correspond to human visual perception. The set of perceptive features of texture proposed are six: coarseness, contrast, directionality, linelikeness, regularity, roughness. Coarseness: has a direct relation to the scale and repetition frequency of primitives (textels), i.e., it is related to the distance of high spatial variations of gray level of textural structures. The coarseness is referred to in [42] as the fundamental characteristic of the texture. The extremes of coarseness property are coar se and f ine. These properties help to identify texture macrostructures and microstructures, respectively. Basically, the measure of coarseness is calculated using local operators with windows of various sizes. A local operator with a large window can be used for coarse textures while operators with small windows are adequate for fine textures. The measures of coarseness are calculated as follows: 1.

A sort of moving average is calculated on windows with variable size (see Fig. 3.23). The size of these windows is chosen with the power of two, i.e., 2k × 2k per k = 0, 1 . . . , 5. The average is calculated by centering each window on each pixel (x, y) of the input image f (x, y) as follows: 1 Ak (x, y) = 2k 2

2.

3.

k−1 −1 y+2k−1 −1 x+2 

i=x−2k−1

 f (i, j)

(3.100)

j=y−2k−1

thus obtaining for each pixel six values of average to vary of k = 0, 1, . . . , 5. For each pixel, we calculate the absolute differences E k (x, y) between pairs of averages obtained from windows that do not overlap (see Fig. 3.24) both in the horizontal direction E k,h that vertical E k,v , respectively, given by E k,h (x, y) = |Ak (x + 2k−1 , y) − Ak (x − 2k−1 , y)

(3.101)

E k,v (x, y) = |Ak (x, y + 2k−1 ) − Ak (x, y − 2k−1 )

(3.102)

For each pixel choose the value of k which maximizes the difference E k in both directions in order to have the highest difference value Sbest (x, y) = 2k

or

Sbest (x, y) = arg max max E k,d (x, y) k=1,...,5 d=h,v

(3.103)

3.13 Tamura’s Texture Features

4.

309

The final measure of coarseness Tcr s is calculated by averaging Sbest over the entire image M N 1  Tcr s = Sbest (i, j) (3.104) M·N i=1 j=1

where M × N is the size of the image. Contrast: evaluated by Tamura as another important feature according to the experimental evaluations of the perceived visual contrast. By definition, the contrast generically indicates the visual quality of an image. More in detail, as proposed by Tamura, the contrast is influenced by the following factors: 1. 2. 3. 4.

the dynamics of gray levels; the accumulation of the gray-level distribution z toward the low values (black) or toward the high values (white) of the relative histogram; the sharpness of the edges; the repeatability period of the structures (enlarged structures appear with low contrast while reduced scale structures are more contrasted).

These 4 factors are considered separately to develop the estimation of the contrast measure. In particular, to evaluate the polarization of the distribution of gray levels, the statistical index of kur tosis is considered to detect how much a distribution is flat or has a peak with respect to the normal distribution.8 To consider the dynamics of gray levels (factor 1) Tamura includes the variance σ 2 in the contrast calculation. The contrast measure Tcon is defined as follows: Tcon = with α4 =

σ α4m

(3.105)

μ4 σ4

(3.106)

8 Normally

the evaluation of a distribution is evaluated with respect to the normal distribution 3 considering two indexes that of asymmetr y (or skewness) γ1 = μ3/2 and kurtosis index γ2 = μ4 μ22

μ2

− 3 where μn indicates the central moment of order n. From the analysis of the two indexes we

detect the deviation of a distribution compared to normal: • γ1 < 0: Negative asymmetry, which is the left tail of the very long distribution; • γ1 > 0: Positive asymmetry, which is the right tail of the very long distribution; • γ2 < 0: The distribution is platykurtic, which is very flat compared to the normal; • γ2 > 0: The distribution is leptokurtic, which is much more pointed than normal; • γ2 = 0: The distribution is mesokurtic, meaning that this distribution has kurtosis statistic similar to that of the normal distribution.

310

3 Texture Analysis

where μn indicates the central order moment n (see Sect. 8.3.2 Vol. I) and m = 41 is experimentally calculated by Tamura in 1978 resulting in a value that produces the best results. It is pointed out that the measure of the contrast (3.105) is based on the kurtosis index α4 defined by Tamura with the (3.106) which is different from the one currently used as reported in the note 8. The measure of the contrast expressed by the (3.105) does not include the factors 3 and 4 above. Directionality: not intended as an orientation in itself but understood as relevant, the presence of orientation in the texture. This is because it is not always easy to describe the orientation of the texture. While it is easier to assess whether two textures differ only in orientation, the directionality property can be considered the same. The directionality measure is evaluated taking into consideration the module and the direction of the edges. According to the Prewitt operator (of edge extraction, see Sect. 1.7 Vol. II), the directionality is estimated with the horizontal derivatives x (x, y) and vertical y (x, y) calculated from the convolution of the image f (x, y) with the 3 × 3 kernel of Prewitt and then evaluating, for each pixel (x, y), the module | | and the θ direction of the edge | | =

 2x + 2y ∼ = | x | + | y |

θ = tan−1

y π + x 2

(3.107)

Subsequently a histogram Hdir (θ ) is constructed from the values of the quantized directions (normally in 16 directions) evaluating the frequency of the edge pixels corresponding to high values of the module, higher than a certain threshold. The histogram is relatively uniform, for images without strong orientations, and presents different peaks, for images with oriented texture. The directionality measure Tdir , proposed by Tamura, considers the sum of the moments of the second order of Hdir relative only to the values around the peaks of the histogram between valleys and valleys, given by Tdir = 1 − r · n p

np  

(θ − θ p )2 Hdir (θ )

(3.108)

p=1 θ∈w p

where n p indicates the number of peaks, θ p is the position of the p-th peak, w p indicates the range of angles included in the p-th peak (i.e., the interval between valleys adjacent to the peak), r is a normalization factor associated with the quantized values of the angles θ , and θ is the quantized angle. Alternatively, we can consider the sum of moments of the second order of all the values of the histogram as the measure of directionality Tdir instead of considering only those around the peaks. Line-Likeness: defines a local texture structure composed of lines so that when a border and the direction of the nearby edges are almost equal they are defined as similar linear structures. The measure of line-likeness is calculated similarly to the co-occurrence matrix GLCM (described in Sect. 3.4.4) only that in this case is calculated the frequency of the directional co-occurrence between pixels of edge that are at a distance d with similar direction (more precisely, if the orientation of

3.13 Tamura’s Texture Features

311

the relative edges is kept within an orientation interval). In the computation of the directional co-occurrence matrix PDd edge with module higher than a predefined threshold is considered by filtering the weak ones. The directional co-occurrence is weighted with the cosine of the difference of the angles of the edge pair, and in this way, the co-occurrences in the same direction are measured with +1 and those with perpendicular directions with −1. The line-likeness measure Tlin is given by n n 2π i=1 j=1 PDd (i, j)cos[(i − j) n ]   (3.109) Tlin = n n i=1 j=1 PDd (i, j) where the directional co-occurrence matrix PDd (i, j) has dimensions n × n and has been calculated by Tamura using the distance d = 4 pixels. Regularity: intended as a measure that captures information on the spatial regularity of texture structures. A texture without repetitive spatial variations is considered r egular , unlike a texture that has strong spatial variations observed as irr egular . The measure of regularity proposed by Tamura is derived from the combination of the measures of previous textures of coarseness, contrast, directionality, and line-likeness. These 4 measurements are calculated by partitioning the image into regions of equal size, obtaining a vector of measures for each region. The measure of regularity is thought of as a measure of the variability of the 4 measures over the entire image (i.e., overall regions). A small variation in the first 4 measurements indicates a regular texture. Therefore, the regularity measure Tr eg is defined as follows: (3.110) Tr eg = 1 − r (σcr s + σcon + σdir + σlin ) where σx x x is the standard deviation of each of the previous 4 measures and r is a useful normalization factor to compensate for the different image sizes. The standard deviations relative to the 4 measurements are calculated using the values of the measurements derived from each region (sub-image). Roughness: is a property that recalls touch rather than visual perception. Nevertheless, it can be a useful property to describe the visual texture. Tamura’s experiments motivate the measure of coarseness not so much due to the visual perception of the variation of the gray levels of the image as to the tactile imagination or to the sensation of physically touching the texture. From Tamura’s psychological experiments no physical-mathematical models emerged to derive roughness measures. A rough approximate measure of the roughness Trgh is proposed, based on the combination of the measure of coarseness and contrast: Trgh = Tcr s + Tcon

(3.111)

Figure 3.25 shows the first 3 Tamura texture measurements estimated for some sample images. Tamura texture measurements are widely used in image recovery applications based on visual attributes contained in the image (known in the literature as

312

3 Texture Analysis

(a)

(b) 2k f(x,y) 2k

(x,y)

A 3x3

65x65

5x5

Fig. 3.23 Generation of the set {Ak }k=0,1,··· ,5 of average images at different scales for each pixel of the input image

2k Ek,a(x,y)=| Ak - Ak

Ak

P(x,y)={E1,a,E1,b,E2,aE2,b,...,E5,a,E5,a} Ek,b(x,y)=|

k

- Ak

Ak 2k

P(x,y)

Ak

Ak

Fig. 3.24 Of the set {Ak }k=0,1,...,5 of average images with different scales k, for each pixel P(x, y), the absolute differences are calculated E k,h and E k,v of the averages between non overlapping pairs and on opposite sides, respectively, in the horizontal direction h and vertical v, as shown in the figure T_crs = 25.6831 T_con = 0.7868 T_dir = 0.0085

T_crs = 6.2593 T_con = 0.2459 T_dir = 0.0045

T_crs = 13.257 T_con = 0.3923 T_dir = 1.5493

T_crs = 31.076 T_con = 0.3754 T_dir = 0.0068

T_crs = 14.987 T_con = 0.6981 T_dir = 1.898

Fig. 3.25 The first 3 Tamura texture measures (coarseness, contrast, and directionality) calculated for some types of images

3.13 Tamura’s Texture Features

313

CBIR Content-Based Image Retrieval) [43]. They have many limitations to discriminate fine textures. Often the first three Tamura measurements are used, treated as a 3D image, whose first three components Coarseness-coNtrast-Directionality (CND) are considered in analogy to the RGB components. Mono-multidimensional histograms can be calculated from the CND image. More accurate measurements can be calculated using other edge extraction operators (for example, Sobel). Tamura measurements have been extended to deal with 3D images [44].

References 1. R.M. Haralick, K. Shanmugam, I. Dinstein, Textural features for image classification. IEEE Trans. Syst. Man Cybern. B Cybern. 3(6), 610–621 (1973) 2. B. Julesz, Visual pattern discrimination. IRE Trans. Inf. Theory 8(2), 84–92 (1962) 3. B. Julesz, Textons, the elements of texture perception, and their interactions. Nature 290, 91–97 (1981) 4. Ruth Rosenholtz. Texture perception. in Johan Wagemans, editor, Oxford Handbook of Perceptual Organization, Oxford University Press, (2015), pp. 167–186. ISBN 9780199686858 5. T. Caelli, B.Julesz, E.N. Gilbert, On perceptual analyzers underlying visual texture discrimination. Part II. Biol. Cybern. 29(4), 201–214 (1978) 6. J.R. Bergen, E.H. Adelson, Early vision and texture perception. Nature 333(6171), 363–364 (1988) 7. R. Rosenholtz, Computational modeling of visual texture segregation, in Computational Models of Visual Processing, ed. by M. Landy, J.A. Movshon (MIT Press, Cambridge, MA, 1991), pp. 253–271 8. R. Haralick, Statistical and structural approaches to texture. Proc. IEEE 67(5), 786–804 (1979) 9. Y. Chen, E. Dougherty, Grey-scale morphological granulometric texture classification. Opt. Eng. 33(8), 2713–2722 (1994) 10. C. Lu, P. Chung, C. Chen, Unsupervised texture segmentation via wavelet transform. Pattern Recognit. 30(5), 729–742 (1997) 11. A.K. Jain, F. Farrokhnia, Unsupervised texture segmentation using gabor filters. Pattern Recognit. 24(12), 1167–1186 (1991) 12. A. Bovik, M. Clark, W. Giesler. Multichannel texture analysis using localised spatial filters.IEEE Trans. Pattern Anal. Mach. Intell. 2, 55–73 (1990) 13. A. Pentland, Fractal-based description of natural scenes. IEEE Trans. Pattern Anal. Mach. Intell. 6(6), 661–674 (1984) 14. G. Lowitz, Can a local histogram really map texture information? Pattern Recognit. 16(2), 141–147 (1983) 15. R. Lerski, K. Straughan, L. Schad, D. Boyce, S. Bluml, ¨ I. Zuna, Mr image texture analysis an approach to tissue characterisation. Magn. Reson. Imaging 11, 873–887 (1993) 16. K.P. William Digital Image Processing. Wiley, second edition, (1991). ISBN 0-471-85766-1 17. S.W. Zucker, D. Terzopoulos, Finding structure in co-occurrence matrices for texture analysis. Comput. Graphics Image Process. 12, 286–308 (1980) 18. L. Alparone, F. Argenti, G. Benelli, Fast calculation of co-occurrence matrix parameters for image segmentation. Electron. Lett. 26(1), 23–24 (1990) 19. L.S. Davis, A. Mitiche, Edge detection in textures. IEEE Comput. Graphics Image Process. 12, 25–39 (1980)

314

3 Texture Analysis

20. M.M. Galloway, Texture classification using grey level runlengths. Comput. Graphics Image Process. 4, 172–179 (1975) 21. X. Tang, Texture information in run-length matrices. IEEE Trans. Image Process. 7(11), 1602– 1609 (1998) 22. A. Chu, C.M. Sehgal, J.F. Greenleaf, Use of gray value distribution of run lengths for texture analysis. Pattern Recogn. Lett. 11, 415–420 (1990) 23. B.R. Dasarathy, E.B. Holder, Image characterizations based on joint gray-level run-length distributions. Pattern Recogn. Lett. 12, 497–502 (1991) 24. G.C. Cross, A.K. Jain, Markov random field texture models. IEEE Trans. Pattern Anal. Mach. Intell. 5, 25–39 (1983) 25. R. Chellappa, S. Chatterjee, Classification of textures using gaussian markov random fields. IEEE Trans. Acoust. Speech Signal Process. 33, 959–963 (1985) 26. J.C. Mao, A.K. Jain, Texture classification and segmentation using multiresolution simultaneous autoregressive models. Pattern Recognit. 25, 173–188 (1992) 27. Divyanshu Rao Sumit Sharma, Ravi Mohan, Classification of image at different resolution using rotation invariant model. Int. J. Innovative Res. Adv. Eng. 1(4), 109–113 (2014) 28. B.B. Mandelbrot, The Fractal Geometry of Nature (Freeman, Cityplace San Francisco, 1983) 29. J.M.S.Chen Keller, R.M. Crownover, Texture description and segmentation through fractal geometry. Comput. Vis. Graphics Image Process. 45(2), 150–166 (1989) 30. R.F. Voss, Random fractals: Characterization and measurement, in Scaling Phenomena in Disordered Systems, ed. by R. Pynn, A. Skjeltorp (Plenum, New York, 1985), pp. 1–11 31. K.I. Laws, Texture energy measures. in Proceedings of Image Understanding Workshop, pp. 41–51 (1979) 32. M.T. Suzuki, Y. Yaginuma, A solid texture analysis based on three dimensional convolution kernels. in Proceedings of the SPIE, vol. 6491, pp. 1–8 (2007) 33. D. Gabor, Theory of communication. IEEE Proc. 93(26), 429–441 (1946) 34. J.G. Daugman, Uncertainty relation for resolution, spatial frequency, and orientation optimized by 2d visual cortical filters. J. Opt. Soc. Am. A 2, 1160–1169 (1985) 35. J. Malik, P. Perona, Preattentive texture discrimination with early vision mechanism. J. Opt. Soc. Am. A 5, 923–932 (1990) 36. I. Fogel, D. Sagi, Gabor filters as texture discriminator. Biol. Cybern. 61, 102–113 (1989) 37. T. Chang, C.C.J. Kuo, Texture analysis and classification with tree-structured wavelet transform. IEEE Trans. Image Process. 2(4), 429–441 (1993) 38. J.L. Chen, A. Kundu, Rotation and gray scale transform invariant texture identification using wavelet decomposition and hidden markov model. IEEE Trans. PAMI 16(2), 208–214 (1994) 39. M. Sonka, V. Hlavac, R. Boyle, Image Processing, Analysis and Machine Vision. CL Engineering, third edition (2007). ISBN 978-0495082521 40. A.R. Rao, R.C. Jain, Computerized flow field analysis: oriented texture fields. IEEE Trans. Pattern Anal. Mach. Intell. 14(7), 693–709 (1992) 41. D. Marr, S. Ullman, Directional selectivity and its use in early visual processing. in Proceedings of the Royal Society of London. Series B, Biological Sciences, vol. 211, pp. 151–180 (1981) 42. S. Mori, H. Tamura, T. Yamawaki, Texture features corresponding to visual perception. IEEE Trans. Syst. Man Cybern. B Cybern. SMC 8(6), 460–473 (1978) 43. S.H. Shirazi, A.I. Umar, S.Naz, N. ul Amin Khan, M.I. Razzak, B. AlHaqbani, Content-based image retrieval using texture color shape and region. Int. J. Adv. Comput. Sci. Appl. 7(1), 418–426 (2016) 44. T. Majtner, D. Svoboda, Extension of tamura texture features for 3d fluorescence microscopy. in Proceedings of 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), pp. 301–307. IEEE, 2012. ISBN 978-1-4673-4470-8

4

Paradigms for 3D Vision

4.1 Introduction to 3D Vision In the previous chapters, we analyzed 2D images and developed algorithms to represent, describe, and recognize the objects of the scene. In this context, the intrinsic characteristics of the 3D nature of the objects were not strictly considered, and consequently, the description of the objects did not include formal 3D information. In different applications, for example, in remote sensing (image classification), in the visual inspection of defects in the industrial sector, in the analysis of microscope images, and in recognizing characters and shapes, while observing 3D scenes, it is not necessarily required 3D vision systems. In other applications, a 3D vision system is required, i.e., a system capable of analyzing 2D images to correctly reconstruct and understand a scene typically of 3D objects. Imagine, for example, a mobile robot that must move freely in a typically 3D environment (industrial, domestic, etc.), its vision system must be able to recognize the objects of the scene, identify unexpected obstacles and avoid them, and calculate its position and orientation with respect to fixed points of the environment and visible in the scene. Another example is the vision systems of robotized cells for industrial automation, where a mechanical arm, guided by the vision system, can be used to pick up and release objects from a bin (problem of bin-picking) and the 3D vision system must be able to: locate the candidate object for the taking, calculate its attitude, all even in the context of overlapping objects (taken from a stack of objects). A 3D vision system has the fundamental problem typical of inverse problems, that is, from single 2D images, which are only a two-dimensional projection of the 3D world (partial acquisition), must be able to reconstruct the 3D structure of the observed scene and eventually define the relationship between the objects. In other words, regardless of the complexity of the algorithms, the 3D reconstruction must take place starting from the 2D images that contain only partial information of the 3D world (loss of information from the projection 3D → 2D) and possibly using the geometric and radiometric parameters of calibration of the acquisition system (for example, a camera). © Springer Nature Switzerland AG 2020 A. Distante and C. Distante, Handbook of Image Processing and Computer Vision, https://doi.org/10.1007/978-3-030-42378-0_4

315

316

4 Paradigms for 3D Vision

The human visual system addresses the problems of 3D vision using a binocular visual system, a remarkable richness of elementary processors (neurons) and a model of reconstruction based also on the a priori prediction and knowledge of the world. In the field of artificial vision, the current trend is to develop 3D systems oriented to specific domains but with characteristics that go in the direction of imitating some functions of the human visual system. For example, use systems with multiple cameras, analyze time-varying image sequences, observing the scene from multiple points of view, and make the most of prior knowledge with respect to the specific application. With 3D vision systems based on these features, it is possible to try to optimize the 2D to 3D inversion process, obtaining the least ambiguous results possible.

4.2 Toward an Optimal 3D Vision Strategy Once the scene is reconstructed, the vision system performs the perception phase trying to make hypotheses that are verified with the predicted model and evaluating its validity. If a hypothesis cannot be accepted, a new2 hypothesis of description of the scene is reformulated until the comparison with the model is acceptable. For the formation of the hypothesis and the verification with the model, different processing steps are required for the acquired data (2D images) and for the data known a priori that represent the models of the world (the a priori knowledge for a certain domain). A 3D vision system must be incremental in the sense that its elementary process components (tasks) can be extended to include new descriptions to represent the model and to extract new features from the images of the scene. In a vision system, the 3D reconstruction of the scene and the understanding of the scene (perception), are the highest level tasks that are based on the results achieved by the lowest level tasks (acquisition, pre-processing, feature extraction, . . .) and the intermediate one (segmentation, clustering, etc.). The understanding of the scene can be achieved only through the cooperation of the various elementary calculation processes and through an appropriate control strategy for the execution of these processes. In fact, the biological visual systems have different control strategies including significant parallel computing skills, a complex computational model based on learning, remarkable adaptive capacity, and high incremental capacity in knowledge learning. An artificial vision system to imitate some functions of the biological system should include the following features: Control of elementary processes: in the sequential and parallel context. It is not always possible to implement as a parallel process, with the available hardware, some typical algorithms of the first stages of vision (early vision). Hierarchical control strategies: Bottom-up (Data Driven). Starting from the acquired 2D image, the vision system can perform the reconstruction and recognition of the scene through the following phases:

4.2 Toward an Optimal 3D Vision Strategy

1.

2. 3.

317

Several image processing algorithms are applied to the raw data (preprocessing and data transformation) to make them available to higher-level processes (for example, to the segmentation process). Extraction of higher-level information such as homogeneous regions corresponding to parts of the object and objects of the scene. Reconstruction and understanding of the scene based on the results of point (2) and on the basis of a priori knowledge.

Hierarchical control strategies: Top-down (Model Driven). In relation to the type of application, some assumptions and properties on the scene to be reconstructed and recognized are defined. In the various phases of the process, in a top-down way, these expected assumptions and properties are verified for the assumed model, up to the elementary data of the image. The vision system essentially verifies the internal information of the model by accepting or rejecting it. This type of approach is a goal-oriented process. A problem is decomposed into subproblems that require lower-level processes that are recursively decomposed into other subproblems until they are accepted or rejected. The top-down approach generates hypotheses and verifies them. For a vision system based on the topdown approach, its ability consists of updating the internal model during the reconstruction/recognition phase, according to the results obtained in the hypothesis verification phase. This last phase (verification) is based on the information acquired by the Low-Level algorithms. The top-down approach applied for vision systems can significantly reduce computational complexity with the additional possibility of using appropriate parallel architectures for image processing. Hybrid control strategies: Data and Model Driven. They offer better results than individual strategies. All high-level information is used to simplify low-level processes that are not sufficient to solve problems. For example, suppose you want to recognize planes in an airport from an image taken from satellite or airplane. Starting with the data driven approach, it is necessary to identify the aircrafts first but at the same time high-level information can be used as the planes will appear with a predefined form and there is a high probability of observing them on the runways, on the connecting sections and in parking lots. Nonhierarchical control strategies: In these cases, different experts or processes can compete to cooperate at the same level. The problem is broken down into several subproblems for each of which some knowledge and experience is required. A nonhierarchical control requires the following steps: 1. 2. 3.

Make the best choice based on the knowledge and current status achieved. Use the results achieved with the last choice to improve and increase the available information about the problem. If the goal has not been reached, return to the first step otherwise the procedure ends.

318

4 Paradigms for 3D Vision

4.3 Toward the Marr’s Paradigm Inspired by the visual systems of nature, two aspects should be understood. The first aspect regards the nature of the input information to the biological vision system, that is, the varying space-time patterns associated with the light projected on the retina and with the modalities of encoding the information of the external environment, typically 3D. The second aspect concerns the output of the biological visual system, i.e., how the external environment (objects, surface of objects, and events) is represented so that a living being can organize its activities. Responding adequately to these two aspects means characterizing a biological vision system. In other words, it is necessary to know what information a living being needs and how this information is encoded in the patterns of light that is projected onto the retina. For this purpose, it is necessary to discover a theory that highlights how information is extracted from the retina (input) and, ultimately, how this information is propagated to the neurons of the brain. Marr [1] was the first to propose a paradigm that explains how the biological visual system processes the complex input and output information for visual perception. The perception process includes three levels: 1. The level of computational theory. The relationship between the quantities to be calculated (data) and the observations (images) must be developed through a rigorous physical-mathematical approach. After this computational theory is developed, it must be understood whether or not the problem has at least one solution. Put the ad hoc solutions and heuristic approaches with methodologies based on a solid and rigorous formulation (physical-mathematical) of the problems. 2. The level of algorithms and data structures. After the Computational Theory is completed, the appropriate algorithms are designed, and, when applied to the input images, they will produce the expected results. 3. The implementation level. After the first two levels are developed, the algorithms can be implemented with adequate hardware (using serial or parallel computing architectures) thus obtaining an operational vision machine. Some considerations on the Marr paradigm: (a) The level of algorithms in the Marr model tacitly includes the level of robustness and stability. (b) In developing a component of the vision system, the three levels (computational theory, algorithms, and implementation) are often considered, but when activating the vision process using real data (images) it is possible to obtain absurd results for having neglected (or not well modeled), for example, in the computational level, the noise present in the input images. (c) Need to introduce stability criteria for the algorithms, assuming, for example, a noise model.

4.3 Toward the Marr’s Paradigm

319

(d) Another source of noise is given by the use of uncertain intermediate data (such as points, edges, lines, contours, etc.) and in this case, stability can be feasible using statistical analysis. (e) Analysis of the stability of results is a current and future theme of research that will avoid the problem of algorithms that have an elegant mathematical basis but do not operate properly on real data. The bottom-up approaches are potentially more general since they operate only considering the information extracted from the 2D images, together with the calibration data of the acquisition system, to interpret the 3D objects of the scene. The vision systems of the type bottom-up are oriented for more general applications. The top-down approaches assume the presence of particular objects or classes of objects that are localized in the 2D images and the problems are solved in a more deterministic way. The vision systems of the top-down type are oriented to solve more specific applications with a more general vision theory. Marr observes that the complexity of vision processes imposes a sequence of elementary processes to improve the geometric description of the visible surface. From the pixels, it is necessary to delineate the surface and derive some of its characteristics, for example, the orientation and the depth with respect to the observer, and finally arrive at the complete 3D description of the object.

4.4 The Fundamentals of Marr’s Theory The input to the biological visual system are the images that are formed on the retina seen as a matrix of values of intensity of reflected light of the physical structures of the observed external environment. The goal of the first stages of vision (early vision) is to create, from the 2D image, a description of the physical structures: the shape of the surface and of the objects, their orientation, and distance from the observer. This goal is achieved by constructing a distinct number of representations, starting from the variations in light intensity observed in the image. This first representation is called Primal Sketch. This primary information describes the variations in intensity present in the image and makes some global structures explicit. This first stage of the vision process, locates the discontinuity of light intensity in correspondence with the edge points, which often coincide with the geometric discontinuities of the physical structures of the observed scene. The primal sketches correspond to the edges and small homogeneous areas present in the image, including their location, orientation and whatever else can be determined. From this primary information, by applying adequate algorithms based on group theory, more complex primary structures (contours, regions, and texture) can be derived, called full primal sketch. The ultimate goal of early vision processes is to describe the surface and shape of objects with respect to the observer, i.e., to produce a world representation observed in

320

4 Paradigms for 3D Vision

the reference system centered with respect to the observer (viewer-centred). In other words, the early vision process in this viewer-centered reference system produces a representation of the world, called 2.5D sketch. This information is obtained by analyzing the information on depth, movement, and derived shape, analyzing the primal sketch structures. The extracted 2.5D structures describe the structures of the world with respect to the observation point. A vision system must fully recognize an object. In other words, it is necessary that the 2.5D viewer centered structures are expressed in the object reference system (object-centered) and not referred to the observer. Marr indicates this level of representation of the world as 3D model representation. In this process of 3D formation of the world model, all the primary information extracted from the primary stages of vision are used, proceeding according to a bottom-up model based on general constraints of 3D reconstruction of objects, rather than on specific hypotheses of the object. The Marr paradigm foresees a distinct number of levels of representation of the world, each of them is a symbolic representation of some aspects of information derived from the retinal image. Marr’s paradigm sees the vision process based on a computational model of a set of symbolic descriptions of the input image. The process of recognizing an object, for example, can be considered achieved when one, among the many descriptions derived from the image, is comparable with one of those memorized, which constitutes the representation of a particular class of the known object. Different computational models are developed for the recognition of objects, and their diversity is based on how concepts are represented as distributed activities on different elementary process units. Some algorithms have been implemented, based on the neural computational model to solve the problem of depth perception and object recognition.

4.4.1 Primal Sketch From the 2D image, the primary information is extracted or primal sketch which can be any elementary structure such as edges, straight and right angle edges (corner), texture, and other discontinuities present in the image. These elementary structures are then grouped to represent higher-level physical structures (contours, parts of an object) that can be used later to provide 3D information of the object, for example, the superficial orientation with respect to the observer. These primal sketch structures can be extracted from the image at different geometric resolutions just to verify their physical consistency in the scene. The primal sketch structures, derived by analyzing the image, are based on the assumption that there is a relationship between zones in the image where the light intensity and the spectral composition varies, and the areas of the environment where the surface or objects are delimited (border between different surfaces or different objects). Let us immediately point out that this relationship is not univocal, and it is not simple. There are reasonable considerations for not assuming that any variation in luminous or spectral intensity in the image corresponds to the boundary or edge of an object or a surface of the scene. For example, consider an environment consisting

4.4 The Fundamentals of Marr’s Theory

321

220 50

200

100

180

Intensity

150 200 250

160 140 120

300

100

350

80

400

60 50

100

150

200

250

300

350

400

450

50

100

150

200

250

300

350

400

450

Distance in pixels

Fig. 4.1 Brightness fluctuation even in homogeneous areas of the image as can be seen from the graph of the intensity profile relative to line 320 of the image

of objects with matte surfaces, i.e., that the light reflected in all directions from every point on the surface has the same intensity and spectral composition (Lambertian model, described in Chap. 2 Vol. I). In these conditions, the boundaries of an object or the edges of a surface will emerge in the image in correspondence of the variations of intensity. It is found that in reality, these discontinuities in the image emerge also for other causes, for example, due to the effect of the edges derived from a shadow that falls on an observed surface. The luminous intensity varies even in the absence of geometric discontinuity of the surface, due as a consequence, that the intensity of the reflected light is a function of the angle of the surface, with respect to the incident light direction. The intensity of the reflected light has maximum value if the surface is perpendicular to the incident light, and decreases as the surface rotates in other directions. The luminous intensity changes in the image, in the presence of a curved surface and in the presence of texture, especially in natural scenes. In the latter case, the intensity and spectral composition of the light reflected in a particular direction by a surface with texture varies locally and consequently generates a spatial variation in luminous intensity within a particular region of the image corresponding to a particular surface. To get an idea about the complex relationship between natural surface and the reflected light intensity resulting in the image, see Fig. 4.1 where it is shown how the brightness varies in a line of the image of a natural scene. Brightness variations are observed in correspondence of variation of the physical structures (contours and texture) but also in correspondence of the background and of homogeneous physical structures. Brightness fluctuations are also due to the texture present in some areas and to the change in orientation of the surface with respect to the direction of the source. To solve the problem of the complex relationship between the structures of natural scenes and the structures present in the image, Marr proposes a two-step approach.

322

4 Paradigms for 3D Vision

Fig. 4.2 Results of the LoG filter applied to the Koala image. The first line shows the extracted contours with the increasing scale of the filter (from left to right) while in the second line the images of the zero crossing are shown (closed contours)

In the first step the image is processed making explicit the significant variations in brightness, thus obtaining what Marr calls a representation of the image, in terms of raw primal sketch. In the second step the edges are identified with particular algorithms that process the information raw primal sketch of the previous step to describe information and structures of a higher level, called Perceptual Chunks. Marr’s approach has the advantage of being able to use raw primal sketch information for other perceptual processes that operate in parallel, for example, to calculate depth or movement information. The first stages of vision are influenced by the noise of the image acquisition system. In Fig. 4.1 it can be observed how the brightness variations in the image are present also in correspondence of uniform surfaces. These variations at different scales are partly caused by noise. In the Chap. 4 Vol. II we have seen how it is possible to reduce the noise present in the image, applying an adequate smoothing filter that does not alter the significant structures of the image (attenuation of the high frequencies corresponding to the edges). Marr and Hildreth [2] proposed an original algorithm to extract raw primal sketch information by processing images of natural scenes. The algorithm has been described in Sect. 1.13 Vol. II, also called the Laplacian of Gaussian (LoG) filter operator, used for edge extraction and zero crossing. In the context of extracting the raw primal sketch information, the LoG filter is used with different Gaussian filters to obtain raw primal sketch at different scales for the same image as shown in Fig. 4.2. It is observed how the various maps of raw primal sketch (zero crossing in this case) represent the physical structures at different scales of representation. In particular, the very narrow Gaussian filter highlights the noise together with significant variations in brightness (small variations in brightness are due to noise

4.4 The Fundamentals of Marr’s Theory

323

Fig. 4.3 Laplacian filter applied to the image of Fig. 4.1

and physical structures), while as a wider filter is used only zero crossing remain corresponding to significant variations (which are not accurately localized) to be associated with the real structures of the scene with the almost elimination of noise. Marr and Hildreth proposed to combine the various raw primal sketch maps extracted at different scales to obtain more robust primal sketch than the original image with contours, edges, homogeneous areas (see Fig. 4.3). Marr and Hildreth assert that at least in the early stages of biological vision the LoG filter is implemented for the extraction of the zero crossing at different filtering scales. The biological evidence of the theory of Marr and Hildreth has been demonstrated by several researchers. In 1953 Kuffler had discovered the spatial organization of the receptive fields of retinal ganglion cells (see Sect. 3.2 Vol. I). In particular, Kuffler [3] discovered the effect on ganglion cells of a luminous spot and observed concentric receptive fields with circular symmetry with a central region of excitation (sign +) and a surrounding inhibitor (see Fig. 4.4). Some ganglion cells instead presented receptive fields with concentric regions excited of opposite sign. In 1966 Enroth-Cugell and Robson [4] discovered, in relation to temporal response properties, the existence of two types of ganglion cells, called X and Y cells. The X cells have a linear response, proportional to the difference between the intensity of light that affects the two areas and this response is maintained over time. The Y cells do not have a linear response and are transient. This cellular distinction is also maintained up to the lateral geniculate nucleus of the visual cortex. EnrothCugell and Robson showed that the intensity contribution for both areas is weighted according to a Gaussian distribution, and the resulting receptive field is described as the difference of two Gaussians (called Difference of Gaussian (DoG) filter, see Sect. 1.14 Vol. II).

324

4 Paradigms for 3D Vision

Ganglion cells Central spot lighting

Ganglion cells

x

x

Peripheral spot lighting

Central lighting

x

x

Peripheral lighting

x lighting

x

Fig. 4.4 Receptive fields of the center-ON and middle-OFF ganglion cells. They have the characteristic of having both the receptive field almost circular and divided into two parts, an internal area called center and an external one called peripheral. Both respond well to changes in lighting between the center and the periphery of their receptive fields. They are divided into two classes of ganglion cells center-ON and center-OFF, based on the different responses when excited by a light beam. As shown in the figure, the first (center-ON) respond with excitement when the light is directed to the center of the field (with spotlight or with light that illuminates the entire center), while the latter (center-OFF) behave in the opposite way, that is, they are very little excited. Conversely, if the light beam affects the peripheral part of both, they are the center-OFF that respond very well (they generate a short electrical excitable membrane signal) while the center-ON are inhibited. Ganglion cells respond primarily to differences in brightness, making our visual system sensitive to local spatial variations, rather than the absolute magnitude (or magnitude) of light affecting the retina

From this, it follows that the LoG operator is seen as a function equivalent to the DoG, and the output of the ∇ 2 G  I operator is analogous to the X retinal and cells of the lateral geniculate nucleus (LGN). Positive values of ∇ 2 G  I correspond to the central zone of the cells X and the negative values for the surrounding concentric zone. In this hypothesis, the problem arises that positive and negative values must be present for the determination of the Zero Crossing in the image ∇ 2 G  I . This would not be possible since the nerve cells cannot operate with negative values in the answers to calculate the zero crossing. Marr and Hildreth explained, for this reason, the existence of cells in the visual cortex that are excited with opposite signs as shown in Fig. 4.4. This hypothesis is weak if we consider the inadequacy of the concentric areas of the receptive fields of the cells X , for the accurate calculation of the function ∇ 2 G I . With the interpretation of Marr and Hildreth, cells with concentric receptive fields cannot determine the presence of edges in a classical way, as done for all the other edge extraction algorithms (see Chap. 1 Vol. II). A plausible biological rule for extracting the zero crossing in the X cells would be to find adjacent active cells that operate with positive and negative values in the central receptive area, respectively, as shown in Fig. 4.5. The zero crossing are determined with the AND logical connection of two cells center-ON and center-OFF (see Fig. 4.5a). With this idea it is possible to extract also segments of zero crossing organizing in two ordered columns of cells with receptive fields of opposite sign (see Fig. 4.5b). Two cells X of opposite sign are connected through a logical AND connection

4.4 The Fundamentals of Marr’s Theory (a)

ing Cross Zero

325 (b)

- + + -

+

-

+

-

+

-

+

-

Fig. 4.5 Functional scheme proposed by Marr and Hildreth for the detection of zero crossing from cells of the visual cortex. a Overlapping receptive fields of two cells center-ON and center-OFF of LGN; if both are active, a zero crossing ZC is located between the two cells and detected by these if they are connected with a logical AND conjunction. b Several different AND logic circuits associated with pairs of cells center-ON and center-OFF (operating in parallel) are shown which detect an oriented segment of zero crossing

producing an output signal only if the two cells are active indicating the presence of zero crossing between the cells. The biological evidence of the LoG operator explains some features of the visual cortex and how it works at least in the early stages of visual perception for the calculation of segments. It is not easy to explain how the nervous system combines this elementary information (zero crossing) generated by ganglion cells to obtain the information called primal sketch. Another important feature of the theory of Marr and Hildreth is to provide a plausible explanation for the existence of cells, in the visual cortex, that operate with different spatial frequencies in a similar way to the filter ∇ 2 G varying the width of the filter itself through the parameter σ of the Gaussian. Campbell and Robson [5] in 1968 discovered with their experiments that visual input is processed in multiple independent channels, each of which analyzes a different band of spatial frequencies.

4.4.2 Toward a Perceptive Organization Following Marr’s theory in the early stages of vision, the first elementary information called primal sketch is extracted. In reality, the visual system uses this elementary information and organizes it in a higher level to generate more important perceptual structures called chunks. This to reach a perception of the world not made of elementary structures (borders and homogeneous areas), but of 3D objects with the visible surface of the objects well reconstructed. How this higher-level perceptual organization is realized, has been studied by several researchers in the nineteenth and twentieth centuries. We can immediately affirm that while for the retina, all its characteristics have been studied and how the signals are transmitted through the optic nerve to the visual cortex, for the latter, called Area 17 or sixth zone, the mechanisms are not yet clear of perception of reconstruction of objects and their motion. Hubel and Wiesel [6] have shown that the signals coming from the retina through the fibers of the optic nerve arrive in the fourth layer of the Area 19 passing from the lateral geniculate nucleus (see Fig. 3.12 of Vol. I). In this area of the visual cortex it is hypothesized that the

326

4 Paradigms for 3D Vision

retinal image is reconstructed maintaining the information from the first stages of vision (first and second derivatives of luminous intensity). From this area of the visual cortex different information is transmitted to the various layers of the visual cortex in relation to the tasks of each layer (motor control of the eyepieces, perception of motion, perception of depth, integration of different primal sketches to generate chunks, etc.).

4.4.3 The Gestalt Theory The in-depth study of human perception began in the nineteenth century, when psychology was established as a modern autonomous discipline detached from the philosophy [7]. The first psychologists (von Helmholtz [8] influenced by J. Stuart Mill) studied perception based on associationism. It was assumed that the perception of an object can be conceived in terms of a set of sensations that emerge with past experience and that the sensations that make up the object have always presented themselves associated with the perceiving subject. Helmholtz asserted that the past perceptive experience of the observer imposes that an unconscious inference automatically links the dimensional aspects of perception of the object taking into account its distance. In the early twentieth century, the association theory was opposed by a group of psychologists (M. Wertheimer 1923, W. Kohler 1947, and K. Koffka 1935) who founded the school of Gestalt psychology (i.e., the psychology of form), who gave the most contributions important to the study of perception. The Gestaltists argued that it is wrong to say that perception can be seen as a sum of stimuli linked by associative laws, based on past experience (as the associationists thought). At the base of the Gestalt theory is this admission: we do not perceive sums of stimuli, but forms, and the whole is much more than the sums of the components that compose it. In Fig. 4.6 it is observed how each of the represented forms are perceived as three squares regardless of the fact that its components are completely different (stars, lines, circles). The Gestalt idea can be stated as the observed set is greater than the sum of its elementary components. The natural world is perceived as composed of discrete objects of various sizes and appear well highlighted with respect to background.

Fig. 4.6 We can feel three square shapes despite being constructed from completely different graphic components

4.4 The Fundamentals of Marr’s Theory

327

Fig. 4.7 Ambiguous figures that produce different perceptions. a Figure with two possible interpretations: two human face profiles or a single black vase; b perception of a young or old woman, designed by William Ely Hill 1915 and reported in a paper by the psychologist Edwin Boring in 1930

Even if the surfaces of the objects have a texture, there is no difficulty in perceiving the contours of the objects, unless they are somehow camouflaged, and generally, the homogeneous areas belonging to the objects (foreground ) with respect to the background. Some graphic and pictorial drawings, made by man can present some ambiguity when they are interpreted and their perception can lead to errors. For example, this happens to distinguish from the background the objects present in famous figures (see Fig. 4.7 that highlight the perceptual ambiguity indicated above). In fact, figure (a), conceived by Gestalt psychologist Edgar Rubin, can be interpreted by perceiving the profiles of two human figures or perceived as a single black vase. It is impossible to perceive both human figures and the vase simultaneously. Figure (b) on the other hand can be interpreted by perceiving an old woman or a young woman. Some artists have produced paintings or engravings with ambiguities between the background and the figure represented, based on the principles of perceptive reversibility. Examples of reversible perception are given by the Necker cube in 1832 (see Fig. 4.8), which consists of a two-dimensional representation of a threedimensional wire-frame cube. The intersections between two lines do not show which line is above the other and which is below, so the representation is ambiguous. In other words, it is not possible to indicate which face is facing the observer and which is behind the cube. Looking at the cube (figure a) or corner (figure d) for a long time they appear alternately concave and convex in relation to the perceptive reactivity of a person. Some first perceive the lower left face of the cube as the front face (figure (b)) facing the observer or alternatively it is perceived as further back as the lower rear face of the cube (figure (c)). In a similar way, the same perceptive reversibility

328

4 Paradigms for 3D Vision

(a)

(b)

(c)

(d)

Fig. 4.8 a The Necker cube is a wire-frame drawing of a cube with no visual cues (like depth or orientation). b One possible interpretation of the Necker cube. It is often claimed to be the most common interpretation because people view objects from above (see the lower left face as being in front) more often than from below). c Another possible interpretation. d The same perceptive reversibility occurs by observing a corner of the cube that appears alternately concave and convex

(a)

(b)

Fig. 4.9 Examples of stable real figures: a hexagonal figure which also corresponds to the twodimensional projection of a cube seen from a corner; b predominantly stable perception of overlapping circular disks

occurs by observing a corner of the cube that appears alternately concave and convex (figure (d)). Several psychologists have tried to study what are the principles that determine the ambiguities in the perception of these figures through continuous perceptual change. In the examples shown, the perceptible data (the objects) are invariant, that is, they remain the same, while, during the perception, only the interpretation between background and objects varies. It would seem that the perceptual organization was of the top-down type. High-level structures of perceptive interpretation seem to condition and guide continuously, low-level structures extracted from image analysis. Normally in natural scenes and in many artificial scenes, no perceptual ambiguity is presented. In these cases, there is a stable interpretation of the components of the scene and its organization. As an example of stable perception we have Fig. 4.9a, which seen individually by several people is perceived as a hexagon, while if we remember the 3D cube of Fig. 4.8a the Fig. 4.9 also represents the figure of the cube seen by an observer positioned on a corner of the cube itself. Essentially, perceptual ambiguity does not occur as the hexagonal Fig. 4.9 is a correct and real two-dimensional projection of the 3D cube seen from one of its corners. Also Fig. 4.9b gives a stable perception of a set of overlapping circular disks, instead of being interpreted, alternatively, as disks interlocked with each other (for example, thinking that two disks have a circular notch for having removed a circular portion).

4.4 The Fundamentals of Marr’s Theory Fig. 4.10 The elements of visual stimulation, close to each other, tend to be perceived as components of the same object: a columns; b lines; c lattice

(a)

329

(b)

(c)

Gestalt psychologists have formulated some ideas about the perceptive organization to explain why, by observing the same figure, there are some different perceptions between them. Some principles of the Gestalt theory were based mainly on the idea of grouping elementary regions of the figures to be interpreted and other principles were based on the segregation of the objects extracted with respect to the background. In the first case, it is fundamental to group together the elementary regions of a figure (object) to obtain a larger region which constitutes, as a whole, the figure to be perceived. Some of the perceptive principles of the Gestalt theory are briefly described below.

4.4.3.1 Proximity A basic principle of the perceptive organization of a scene is the proximity of its elementary structures. Elementary structures of the scene, which are close to each other, are perceived grouped. In Fig. 4.10a we observe vertical and horizontal linear structures since the vertical spacing of the elementary structures (points) is smaller, compared to the horizontal one (vertical structures), while the horizontal spacing is smaller than the vertical one when observing the horizontal structures (see Fig. 4.10b). If they are equally spaced horizontally and vertically (see Fig. 4.10c) the perceived aggregation is a regular grid of points. An important example of the principle of proximity is in the visual perception of depth as we will describe later in the paragraph on binocular vision.

4.4.3.2 Similarity Elementary structures of the scene that appear similar tend to be grouped together from the perceptive point of view (see Fig. 4.11). The regions appear distinct because, the visual stimulus elements that are similar, are perceived grouped and components of the same grouping (figure (a)). In Fig. 4.11b vertical linear structures are perceived although, due to the proximity principle, they should be perceived as horizontal structures. This shows that the principle of similarity can prevail over information perceived by the principle of proximity. There is no strict formulation of how structures are aggregated with the principle of similarity.

330

4 Paradigms for 3D Vision

(a)

(b)

Fig. 4.11 Similar elements of visual stimulation tend to be perceived grouped as components of the same object: a two distinct perceived regions consisting of groups of similar elements; b perceive the vertical structures because the dominant visual stimulus is the similarity, between the visual elements, rather than the proximity

4.4.3.3 Common Fate Structures that appear to move together (in the same direction) are perceived grouped and be visible as a single region. In essence, perception associates movement as part of the same stimulus. In nature, many animals camouflage their apparent texture with the background, but as soon as they move their structure emerges and they are easily visible with the perception of motion. For example, birds can be distinguished from the background as a single swarm because they are moving in the same direction and at the same speed, even when viewed from a distance each as a moving point (see Fig. 4.12). The set of mobile points seems to be part of a single entity. Similarly, two flocks of birds can cross in the observer’s field of view that will continue to perceive them as separate swarms because each bird has a direction of motion common to its own swarm. The perception of motion becomes strategic in environmental contexts where the visual stimuli (insufficient lighting) linked to the color or contour of the objects are missing. In animals, this ability has probably developed greatly from evolutionary needs to distinguish a camouflaged predator from its background.

Fig. 4.12 The elements of visual stimulation that move, in the same direction and speed, are perceived as components of the same whole: in this case a swarm of birds

4.4 The Fundamentals of Marr’s Theory

(a)

331

+ *

(b)

X

*+

Δ

@

Δ

Δ

Fig. 4.13 Looking at the figure a we have the perception of two curves that intersect in X for the principle of continuity and not two separate structures. The same happens for the figure b where in this case the mechanisms of perception combine perceptual stimuli of continuity and proximity Fig. 4.14 Closure principle. We tend to perceive a form with closed margins even if physically nonexistent

4.4.3.4 Continuity As shown in Fig. 4.13a, an observer tends to perceive two intersecting curves at the point X , instead of perceiving two separate irregular graphic structures that touch at the point X . The Gestalt theory justifies this type of perception in the sense that we tend to preserve the continuity of a curvilinear structure instead of producing structures with strong discontinuities and graphic interruptions. Some mechanisms of perception combine the Gestalt principles of proximity and continuity. This explains why completely dissimilar elementary structures can be perceived as belonging to the same graphic structure (see Fig. 4.13b).

4.4.3.5 Closure The perceptive organization of different geometric structures, such as shapes, letters, pictures, etc., can generate closed figures as shown in Fig. 4.14 (physically nonexistent) thanks to inference with other forms. This happens even when the figures are partially overlapping or incomplete. If the closure law did not exist, the image would represent an assortment of different geometric structures with different lengths, rotations, and curvatures, but with the law of closure, we perceptually combine the elementary structures into whole forms.

332

4 Paradigms for 3D Vision

(a)

(b)

Fig. 4.15 Principle of symmetry. a Symmetrical elements (in this case with respect to the central vertical axis) are perceived combined and appear as a coherent object. b This does not happen in a context of asymmetric elements where we tend to perceive them as separate

4.4.3.6 Symmetry The principle of symmetry indicates the tendency to perceive symmetrical elementary geometric structures against the antisymmetric ones. Visual perception favors the connection of symmetrical geometric structures forming a region around the point or axis of symmetry. In Fig. 4.15a it is observed how symmetrical elements are easily perceptively connected to favor a symmetrical form rather than considering them separated as instead happens for the figure (b). Symmetry plays an important role in the perceptual organization by combining visual stimuli in the most regular and simple way possible. Therefore, the similarities between symmetrical elements increase the probability that these elements are grouped together forming a single symmetrical object.

4.4.3.7 Relative Dimension and Adjacency When objects are not equal, perception tends to bring out the smaller object (smallness) contrasting it with the background. The object appears with a defined extension while the background does not. When the object is easily recognizable as a known object, the rest becomes all background. As highlighted in Fig. 4.16a, the first object is smaller in size than the rest (the background seen as a white cross or a helix, delimited by the circular contour) and is perceived as a cross or a black helix. This effect is amplified, as shown in Fig. 4.16b, since the white area surrounding (surroundedness) the black area tends to be perceived as a figure for the principle of surroundedness but has a smaller dimension. If we orient Fig. 4.16a in such a way that the white area is arranged along the horizontal and vertical axes, as highlighted in Fig. 4.16c, then it is easier to perceive it despite being larger than the black one. There seems to be a preference to perceive figures arranged in horizontal and vertical positions over oblique positions. As a further illustration of the Gestalt principle smallness in Fig. 4.16d, Rubin’s figure is shown, highlighting that due to the small relative dimension principle the ambiguity in perceiving the vase rather than the faces or vice versa can be eliminated. We also observe how the dominant perception is the black region that also emerges from the combination of perceptive principles of relative dimension, orientation, and symmetry.

4.4 The Fundamentals of Marr’s Theory

(a)

(b)

333

(c)

(d)

Fig. 4.16 Principle of relative dimension and adjacency. a The dominant perception aided by the contour circumference is that of the black cross (or seen as a helix) because it is smaller than the possible white cross. b The perception of the black cross becomes even more evident by becoming even smaller. c After the rotation of the figure, the situation is reversed, the perception of the white cross is dominant because the vertical and horizontal parts are better perceived than the oblique ones. d The principle of relative size eliminates the ambiguity of the vase/face figure shown in Fig. 4.7, in fact, in the two vertical images on the left, the face becomes background and the vase is easily perceived, vice versa, in the two vertical images on the right the two faces are perceived while the vase is background Fig. 4.17 Example of a figure perceived as E, based on a priori knowledge, instead of perceiving three broken lines

4.4.3.8 Past Experience A further heuristic, suggested by Wertheimer, in perceptual organization concerns the role of the past experience or meaning of the object under observation. Past experience implies that under some circumstances visual stimuli are categorized according to past experience. This can facilitate the visual perception of objects with which one has greater knowledge with respect to objects never seen, all other principles being considered being equal. For example, the object represented in Fig. 4.16a experience can facilitate the perception of the object seen as a cross or helix. From the point of view of meaning, knowing the letters of the alphabet, the perception of Fig. 4.17 leads to the letter E rather than to single broken lines.

4.4.3.9 Good Gestalt or Prägnanz All the Gestalt principles of perceptual organization lead to the principle of good form or Prägnanz even if there is no strict definition of it. Pr¨agnanz is a German word that has the meaning of pithiness and in this context implies the perceptive ideas of

334

4 Paradigms for 3D Vision

salience, simplicity, and orderliness. Humans will perceive and interpret ambiguous or complex objects as the simplest forms possible. The form that is perceived is so good in relation to the conditions that allow it. In other words, what is perceived or what determines perceptive stimuli to make a form appear is intrinsically the characteristic of pr¨agnanz or good form it possesses in terms of regularity, symmetry, coherence, homogeneity, simplicity, conciseness, and compactness. These characteristics contribute to the greater probability of favoring various stimuli for the perception process. Thanks to these properties the forms take on a good shape and certain irregularities or asymmetries are attenuated or eliminated, also due to the stimuli deriving from the spatial relations of the elements of the field of view.

4.4.3.10 Comments on the Gestalt Principles The dominant idea behind the Gestalt perceptive organization was expressed in terms of certain force fields that were thought to operate inside the brain. In the Gestalt idea, it was considered the doctrine of Isomorphism, according to which, for every perceived sensory experience there corresponds a brain event that is structurally similar to that perceptive experience. For example, when a circle is perceived, a perceptual structure with circular traces is activated in the brain. Force fields came into play to produce results as stable as possible. Unfortunately, no rigorous demonstration was given on the evidence of force fields and the Gestalt theory was abandoned leaving only some perceptive principles but without highlighting a valid model of the perceptual process. Today the Gestalt principles are considered vague and inadequate to explain perceptual processes. Subsequently, some researchers (Julesz, Olson and Attneave, Beck) considered some of the Gestalt principles to have a measure of Validity of the Shape. They addressed the problem of grouping by perceived elementary structures. In other words, the problem consists in quantifying the similarity of the elementary structures before they are grouped together. One strategy is to evaluate which are the significant variables characterizing the elementary structures that determine their grouping by similarity. This can be explored by observing how two different regions of elementary structures (patterns) are segregated (isolated) in perceptual terms, between them. The logic of this reasoning leads us to think that if the two regions have very coherent patterns between them, due to the perceptual similarity that enters between them, the two regions will be less visible without presenting a significant separation boundary. The problem of grouping by similarity has already been highlighted in Fig. 4.11 in the Gestalt context. In [9], experimental results are reported on people to assess the perceptive capacity of homogeneous regions where visual stimuli are influenced by different types of features and their different orientations. Figure 4.18a shows the circular field of view where three different regions are perceived, characterized only by the different orientation of the patterns (segments of fixed length). In figure (b) in the field of view, some curved segments are randomly located and oriented, and one can just see parts of homogeneous regions. In figure (c) the patterns are more complex but do not present a significantly different orientation

4.4 The Fundamentals of Marr’s Theory

(a)

(b)

335

(c)

Fig. 4.18 Perception of homogeneous and nonhomogeneous regions. a In the circular field of view three different homogeneous regions are easily visible characterized by linear segments differently oriented and with a fixed length. b In this field of view some curved segments are randomly located and oriented, and one can just see parts of homogeneous regions. c Homogeneous regions are distinguished, with greater difficulty, not by the different orientation of the patterns but by the different types of the same patterns

between the regions. In this case, it is more difficult to perceive different regions and the process of grouping by similarity is difficult. It can be concluded by saying that the pattern orientation variable is a significant parameter for the perception of the elementary structures of different regions. Julesz [10] (1965) emphasized that in the process of grouping by similarity, vision occurs spontaneously through a perceptual process with pre-attention, which precedes the identification of elementary structures and objects in the scene. In this context, a region with a particular texture is perceived by carefully examining and comparing the individual elementary structures in analogy to what happens when one wants to identify a camouflaged animal through a careful inspection of the scene. The hypothesis that is made is that the mechanism of pre-attention grouping is implemented in the early stages of the vision process (early stage) and the biological evidence that the brain responds if patterns of texture differ significantly in the orientation between the central and external area of their receptive fields. Julesz extends the study of grouping by similarity, including other significant variables to characterize the patterns, such as brightness and color, which are indispensable for the perception of complex natural textures. Show that two regions are perceived as separate if they have a significant difference in brightness and color. These new parameters, the color and brightness, used for grouping, seem to influence the perceptive process by operating on the mean value instead of considering punctual values of the differences between brightness and color values. A region that presents half dominant patterns with black squares and levels of gray toward black, and the other half with dominant patterns with white squares and light gray levels, is perceived with two different regions thanks to the perception of the boundary of its subregions.

336 Fig. 4.19 Perception of regions with identical brightness, but separable due to the different spatial density of the patterns between the region (a) and (b)

4 Paradigms for 3D Vision

(a)

(b)

The perceptual process discriminates the regions based on the average value of the brightness of the patterns in the two subregions and is not based on the details of the composition of these average brightness values. Similarly, it occurs to separate regions with different colored patterns. Julesz also used parameters based on the granularity or spatial distribution of patterns for the perception of different regions. Regions that have an identical average brightness value, but a different spatial pattern arrangement, have an evident separation boundary and are perceived as different regions (see Fig. 4.19). In analogy with what has been pointed out by Bech and Attneave, Julesz has also found the orientation parameter of the patterns for the perception of the different regions to be important. In particular, he observed the process of perception based on grouping and evaluated in formal terms the statistical properties of the patterns to be perceived. At first, Julesz argued that two regions cannot be perceptually separated if their first and second-order statistics are identical. The statistical properties were derived mathematically based on Markovian processes in the one-dimensional case or by applying random geometry techniques generating two-dimensional images. The differences in the statistical properties of the first order of the patterns highlight the differences in the global brightness of the patterns (for example, the difference in the value of the average brightness of the regions). The difference in the statistical properties of the second order highlights instead the differences in the granularity and orientation of the patterns. Julesz’s initial approach, which attempted to model the significant variables to determine grouping by similarity in mathematical terms, was challenged applied to artificial concrete examples, and Julesz himself modified his theory to the theory of texton (structures elementary of the texture). He believed that only a difference

4.4 The Fundamentals of Marr’s Theory

337

of the texton or their density can be detected in pre-attentive mode.1 No positional information on neighboring texton is available without focused attention. He stated that the pre-attentive process operates in parallel while the attentive (i.e., focused attention) operates in a serial way. The elementary structures of the texture (texton) can be elementary homogeneous areas (blob) or rectilinear segments (characterized by parameters such as appearance ratio and orientation) that correspond to the primal sketch representation proposed by Marr (studies on the perception of texture will influence the same Marr). Figure 4.20 shows an image with a texture generated by two types of texton. Although the two texton (figure (a) and (b)) are perceived as different when viewed in isolation, they are actually structurally equivalent, having the same size and consisting of the same number of segments and each having two ends of segments. The figure image (c) is made from texton of type (b) representing the background and from texton of type (a) in the central area representing the region of attention. Since both textons are randomly oriented and in spite of being structurally equivalent, the contours of the two textures are not proactively perceptible except with focused attention. It would seem that only the number of texton together with their shape characteristics are important, while as they are spatially oriented together with the closure and continuity characteristics they do not seem to be important. Julesz states that vision with attention does not determine the position of symbols but evaluates their number, or density or first-order statistics. In other words, Julesz’s studies emphasize that the elementary structures of texture do not necessarily have to be identical to be considered together by the processes of grouping. These processes seem to operate between symbols of the same brightness, orientation, color, and granularity. These properties correspond to those that we know to be extracted from the first stages of vision, and corre-

1 Various studies have highlighted

mechanisms of co-processing of perceptive stimuli, reaching the distinction between automatic attentive processes and controlled attentive processes. In the first case, the automatic processing (also called pre-attentive or unconscious) of the stimuli would take place in parallel, simultaneously processing different stimuli (relating to color, shape, movement, . . .) without the intervention of attention. In the second case, the focused attention process requires a sequential analysis of perceptive stimuli and the combination of similar ones. It is hypothesized that the pre-attentive process is used spontaneously as a first approach (in fact it is called pre-attentive) in the perception of familiar and known (or undemanding) things, while the focused attention intervenes subsequently, in situations of uncertainty or in sub-optimal conditions, sequentially analyzing the stimuli. In other words, the mechanism of perception would initially tend toward a pre-attentive process by processing in parallel all the stimuli of the perceptual field, but in front of an unknown situation, if necessary activating a focused attention process to adequately integrate the information of the individual stimuli. The complexity of the perception mechanisms is not yet rigorously demonstrated.

338

4 Paradigms for 3D Vision

(c)

(a)

(b)

Fig. 4.20 Example of two non-separable textures despite being made with structurally similar elements (texton) (figure a and b) in dimensions, number of segments and terminations. The region of interest, located at the center of the image (figure c) and consisting of texton of type (b), is difficult to perceive, with respect to the background (distractor component) represented by texton of type (a), when both texton are randomly oriented

spond to the properties extracted with the description in terms of raw primal sketch proposed initially by Marr. So far we have quantitatively examined some Gestalt principles such as the similarity and evaluation of the form to justify the perception based on grouping. All the studies used only artificial images. It remains to be shown whether this perceptive organization is still valid when applied for natural scenes. Several studies have demonstrated the applicability of the Gestalt principles to explain how some animals with their particular texture are able to camouflage themselves in their habitat by making it difficult to perceived its predatory antagonist. Various animals, prey and predators, have developed particular systems of perception that are different from each other and are also based on the process of grouping by similarity or on the process of grouping by proximity.

4.4.4 From the Gestalt Laws to the Marr Theory We have analyzed how the Gestalt principles are useful descriptions to analyze perceptual organization in the real world, but we are still far from having found an adequate theory that explains why these principles are valid and how perceptual organization is realized. The same Gestalt psychologists have tried to answer these questions thinking of a model of the brain that provides force fields to characterize objects. Marr’s theory attempts to give a plausible explanation of the process of perception, emphasizing how such principles can be incorporated into a computational model that detects structures hidden in uncertain data, obtained from natural images. With

4.4 The Fundamentals of Marr’s Theory

Input image Z

339

Raw Primal sketch: Edges Blobs Bars Segments ....... Attributes: Position Orientation Contrast Dimension

Fig. 4.21 Example of map raw primal sketch derived from zero crossing calculated at different scales by applying the LoG filter to the input image

Marr’s theory, with the first stages of vision, they are not understood as processes that directly solve the problem of segmentation or to extract objects as is traditionally understood (generally complex and ambiguous activity). The goal of the early vision process seen by Marr is to describe the surface of objects by analyzing the real image, even in the presence of noise, of complex elementary structures (textures) and of shaded areas. We have already described in Sect. 1.13 Vol. II the biological evidence for the extraction of zero crossings from the early vision process proposed by Marr-Hildreth, to extract the edges of the scene, called raw primal sketch. The primitive structures found in the raw primal sketch maps are edges, homogeneous elementary areas (blob), bars, terminations in general, which can be characterized with attributes such as orientation, brightness, length, width, and position (see Fig. 4.21). The map of raw primal sketch is normally very complex, from which it is necessary to extract global structures of higher level such as the surface of objects and the various textures present. This is achieved with the successive stages of the processes of early vision by recursively assigning place tokens (elementary structures such as bars, oriented segments, contours, points of discontinuity, etc.) to small structures of the visible surface or aggregation of structures, generating primal sketch maps. These structures place tokens are then aggregated together to form larger structures and if possible this is repeated cyclically. The place tokens structures can be characterized (at different scales of representation) by the position of the elementary structures represented (elementary homogeneous areas, edges or short straight segments or curved lines), or by the termination of a longer edge, a line or area elongated elementary, or from a small aggregation of symbols. The process of aggregation of the place tokens structures can proceed by clustering those close to each other on the basis of changes in spatial density (see Fig. 4.22) and proximity, or by curvilinear aggregation that produces contours joining aligned structures that they are close to each other (see Fig. 4.23) or through the aggregation

340

4 Paradigms for 3D Vision

(a)

Blob

Spatial grouping

(b)

Place Token

Fig. 4.22 Full primal sketch maps derived from raw primal sketch maps by organizing elementary structures (place tokens) into larger structures, such as regions, lines, curves, etc., according to continuity and proximity constraints Linear and curved aggregation

light d

irectio n

(a)

Edge and curve alignment

(b)

Termination alignment

(c) Curved alignment

Fig. 4.23 Full primal sketch maps derived from raw primal sketch maps by aggregating elementary structures (place tokens), like elements of boundaries or terminations of segments, in larger structures (for example, contours), according to spatial alignment constraints and continuity. a The shading, generated by the illumination in the indicated direction, creates local variations of brightness, and applying closing principles are also detected contours related to the shadows, in addition to the edges of the object itself; b The circular contour is extracted from the curvilinear aggregation of the terminations of elementary radial segments (place tokens); c Curved edge detection through curvilinear aggregation and alignment of small elementary structures

of texture oriented structures (see Fig. 4.24). This latter type of aggregation implies a grouping of similar structures oriented in a given direction and other groups of similar structures oriented in other directions. In this context, it is easy to extract rectilinear or curved structures by aligning the terminations or the discontinuities as shown in the figures. The process of grouping together with the place tokens structures is based on local proximity (i.e., they are combined adjacent elementary structures) and on similarity (i.e., they are combined elementary oriented structures), but the determined structures of the visible surface can be influenced by more global considerations. For example, in the context of curvilinear aggregation, a closure principle can allow two segments, which are edge elements, to be joined even if the change in brightness across the segments is sensitive, due to the effect of the lighting conditions (see Fig. 4.23a). Marr’s approach combines many of the Gestalt principles discussed above. From the proposed grouping processes, we derive primal sketch structures at different scales to physically locate the significant contours in the images. It is important to derive different types of visible surface properties at different scales, as Marr claims. The contours due to the change in the reflectance of the visible surface (for example, due to the presence of two overlapping objects), or due to the discontinuity of the orientation or depth of the surface can be detected in two ways.

4.4 The Fundamentals of Marr’s Theory

(a)

341

(b)

Fig. 4.24 Full primal sketch maps derived from raw primal sketch maps by organizing elementary oriented structures (place tokens) in vertical edges, according to constraints of directional discontinuity of the oriented elementary structures

In the first way, the contours can be identified with place tokens structures. The circular contour perceived in Fig. 4.23b can be derived through the curvilinear aggregation of place tokens structures assigned to the termination of each radial segment. In the second way, the contours can be detected by the discontinuity in the parameters that describe the spatial organization of the structures present in the image. In this context, changes in the local spatial density of the place tokens structures, their spacing or their dominant orientation could be used together to determine the contours. The contour that separates the two regions in Fig. 4.24b is not detected by analyzing the structures place tokens, but is determined by the discontinuity in the dominant orientation of the elementary structures present in the image. This example demonstrates how Marr’s theory is applicable to the problem of texture and the separation of regions in analogy to what Julesz realized (see Sect. 3.2). As already known, Julesz attempted to produce a general mathematical formulation to explain why some texture contours are clearly perceived while others are not easily visible without a detailed analysis. Marr instead provides a computational theory with a more sophisticated explanation. Julesz’s explanation is purely descriptive while Marr has shown how a set of descriptive principles (for example, some Gestalt principles) can be used to recover higher-level textures and structures from images. The success of Marr’s theory is demonstrated by being able to obtain results from the procedures of early vision to determine the contours even in the occlusion zones. The derived geometrical structures are based only on the procedures of initial early vision, without having any high-level knowledge. In fact, the procedures of early vision have no knowledge on the possible shape of the teddy bear’s head and do not find the contours of the eyes since they do not have any prior knowledge to find them. We can say that Marr’s theory, based on the processes of early vision, contrasts strongly with the theories of perceptive vision that are based on the expectation and hypothesis of objects that drive each stage of perceptual analysis. These

342

4 Paradigms for 3D Vision

last theories are based on the knowledge of the objects of the world, for which in the computer there is a model of representation.

The procedures of early vision proposed by Marr also operate correctly in natural images since they include some principles of grouping that reflect some general properties of the observed world. Marr’s procedures use higher-level knowledge only in the case of ambiguity. In fact, in Fig. 4.23a shows the edges of the object but also those caused by shadows not belonging to the object itself. In this case, the segmentation procedure may fail to separate the object’s contour from the background if it does not know a priori information about the lighting conditions. In general, ambiguities cannot be solved by analyzing a single image, but by considering the additional information as that deriving from stereo vision (observation of the world from two different points of view) or from the analysis of motion (observing time-varying image sequences) in order to extract depth and movement information present in the scene, or by knowing the lighting conditions.

4.4.5 2.5D Sketch of Marr’s Theory In Marr’s theory, the goal of the first stages of vision is to produce a description of the visible surface of the objects observed together with the information indicating the structure of the objects with respect to the observer’s reference system. In other words, all the information extracted from the visible surface is referred to the observer. The primal sketch data are analyzed to perform the first level of reconstruction of the visible surface of the observed scene. These data (extracted as a bottom-up approach) together with the information provided by some modules of early vision, such as depth information (between scene and observer) and orientation of the visible surface (with respect to the observer) form the basis for the first 3D reconstruction of the scene. The result of this first reconstruction of the scene is called 2.5D Sketch, in the sense that the result obtained, generally orientation maps (called also needle map) and depth map, is something more than 2D information but it cannot be considered as a 3D reconstruction yet. The contour, the texture, the depth, orientation and movement information (called by Marr information full primal sketch), extracted from the processes of early vision, such as stereo vision, movement analysis, analysis of texture and color, all together contribute to the production of 2.5D-sketch maps seen as intermediate information, temporarily stored, which give a partial solution waiting to be processed by the perceptual process, for the reconstruction of the visible surface observed. Figure 4.25, shows an example of a 2.5D sketch map representing the visible surface of a cylindrical object in terms of orientation information (with respect to the observer) of elementary portions (patch) of the visible surface of the object. To render this representation effective, the orientation information is represented with oriented

4.4 The Fundamentals of Marr’s Theory

343

Fig. 4.25 2.5D sketch map derived from the full primal sketch map and from the orientation map. The latter adds the orientation information (with respect to the observer) for each point of the visible surface. The orientation of each element (patch) of visible surface is represented by a vector (seen as a little needle) whose length indicates how much it is inclined with respect to the observer (maximum length means the direction perpendicular to the observer, zero indicates pointing toward the observer), while the direction coincides with the normal at the patch

needles, whose inclination indicates how the patches are oriented with respect to the observer. These 2.5D sketch maps are called nedle map or orientation maps of the observed visible surface. The length of each oriented needle describes the level of inclination of the patch with respect to the observer. A needle with zero length indicates that the patch is perpendicular to the vector that joins the center of the patch with the observer. The increase in the length of the needle implies the increase of the inclination of the patch with respect to the observer. The maximum length is when the inclination angle of the patch reaches 90◦ . In a similar way, the depth map can be represented for each patch that indicates its distance from the observer. It can be verified that the processes of early vision do not produce information (of orientation, depth, etc.) in some points of the visual field and this involves the production of a 2.5D sketch map with some areas without information. In this case, interpolation processes can be applied to produce the missing information on the map. This also happens with the stereo vision process in the occlusion areas where no depth information is produced, while in the shape of shading process2 (for extraction of the orientation maps) this occurs in areas of strong light discontinuity (or depth), in which the orientation information is not produced.

2 The

shape-from-shading we will see later that it is a method to reconstruct the surface of a 3D object from a 2D image based on the shading information, i.e., it uses shadows and light direction as reference points to interpolate the 3D surface. This method is used to obtain a 3D surface orientation map.

344

4 Paradigms for 3D Vision

4.5 Toward 3D Reconstruction of Objects The final goal of vision is the recognition of the objects observed. Marr vision processes through 2.5D sketch maps produce a representation, with respect to the observer, of the visible surface of 3D objects. This implies that the representation of the observed object varies in relation to the point of view, thus making the recognition process more complex. To discriminate between the observed objects, the recognition process must operate by representing the objects with respect to their center of mass (or with respect to an absolute reference system) and not with respect to the observer. This third level of representation is called object centred. Recall that the shape that characterizes the object observed expresses the geometric information of the physical surface of an object. The representation of the shape is a formal scheme to describe a form or some aspects of the form together with rules that define how the schema is applied to any particular form. In relation to the type of representation chosen (i.e., the chosen scheme), a description can define a shape with a rough or detailed approximation. There are different models of 3D representation of objects. The representation of the objects viewer centered or object centered, becomes fundamental to characterize the same process of recognition of the objects. If the vision system uses a viewer centered representation to fully describe the object it will be necessary to have a representation of it through different views of the object itself. This obviously requires more memory to maintain an adequate representation of all the observed views of the objects. Minsky [11] proposed to optimize the multi-view representation of objects, choosing significant primitives (for example, 2.5D sketch information) representing the visible surface of the observed object from appropriate different points of view. An alternative to the multi-view representation is given by the object centered representation that certainly requires less memory to maintain a single description of the spatial structures of the object (i.e., it uses a single model of 3D representation of the object). The recognition process will have to recognize the object, whatever its spatial arrangement. We can conclude by saying that a object centered description presents more difficulties for the reconstruction of the object, since a single coordinate system is used for each object and this coordinate system is identified by the image before the description of the shape be rebuilt. In other words, the form is described not with respect to the observer but relative to the object centered coordinate system that is based on the form itself. For the recognition process, the viewer centered description is easier to produce, but is more complex to use than the object centered one. This depends on the fact that the viewer centered description depends very much on the observation points of the objects. Once defined the coordinate system (viewer centered or object centered) that is the way to organize the representation that imposes a description of the objects, the fundamental role of the vision system is to extract the primitives in a stable and unique (invariance of information associated with primitives) which are the basic information used for the representation of the form. We have previously mentioned the 2.5D sketch information, extracted from the process of early vision,

4.5 Toward 3D Reconstruction of Objects

345

Fig. 4.26 A representation of objects from elementary blocks or 3D primitives

which essentially is the orientation and depth (calculated with respect to the observer) of the visible surface observed in the field of view and calculated for each point of the image. In essence, primitives contain 3D information of the visible surface or 3D volumetric information. The latter include the spatial information on the form. The complexity of primitives, considered as the first representation of objects, is closely linked to the type of information that can be derived from the vision system. While it is possible to define even complex primitives by choosing a model of representation of the sophisticated world, we are nevertheless limited by the ability of vision processes that are not able to extract consistent primitives. It achieves an adequate choice between the model of representation of the objects and the type of primitives that must guarantee stability and uniqueness [12]. The choice of the most appropriate representation will depend on the application context. A limited representation of the world, made in blocks (cubic primitives), can be adequate to represent for example, the objects present in an industrial warehouse that produces packed products (see Fig. 4.26). A plausible representation for people and animals, can be to consider as solid primitive cylindrical or conical figures [12] properly organized (see Fig. 4.27), known as generalized cones.3 With this 3D model, according to the approach of Marr and Nishihara [12], the object recognition process involves the comparison between the 3D reconstruction obtained from the vision system and the 3D representation of the object model using generalized cones, previously stored in memory. In the following paragraphs, we will analyze the problems related to the recognition of objects. In conclusion, the choice of the 3D model of the objects to be recognized, that is, the primitives, must be adequate with respect to the application context that character-

3A

generalized cone can be defined as a solid created by the motion of an arbitrarily shaped crosssection (perpendicular to the direction of motion) of constant shape but variable in size along the symmetry axis of the generalized cone. More precisely a generalized cone is defined by drawing a 2D cross-section curve at a fixed angle, called the eccentricity of the generalized cone, along a space curve, called spine of the generalized cone, expanding the cross-section according to a sweeping rule function. Although spine, cross-section, and sweeping rule can be arbitrary analytic functions, in reality only simple functions are chosen, in particular, a spine is a straight or circular line, the sweeping rule is constant or linear, and the cross-section is generally rectangular or circular. The eccentricity is always chosen with a right angle and consequently the spine is normal to the cross-section.

346 Fig. 4.27 3D multiscale hierarchical representation of a human figure by using object-centered generalized cones

4 Paradigms for 3D Vision Human Arm

Forearm Hand

izes the model of representation of objects, the modules of the vision system (which must reconstruct the primitives) and the recognition process (that must compare the reconstructed primitives and those of the model). With this approach, invariance was achieved thanks to the coordinate system centered on the object whose 3D components were modeled. Therefore, regardless of the point of view and most viewing conditions, the same 3D structural description that is invariant from the point of view would be retrieved by identifying the appropriate characteristics of the image (for example, the skeleton or principal axes of the object) by retrieving a canonical set of 3D components and comparing the resulting 3D representation with similar representations obtained from the vision system. Following Marr’s approach, in the next chapter, we will describe the various vision methods that can extract all possible information in terms of 2.5 D Sketch from one or more 2D images, in order to represent the shape of the visible surface of the objects of the observed scene. Subsequently, we will describe the various models of representation of the objects, analyzing the more adequate primitives 2.5 D Sketch, also in relation to the typology of the recognition process. A very similar model to the theory of Marr and Nishihara is Biederman’s model [13,14] known as Recognition By Components-RBC. The limited set of volumetric primitives used in this model is called geons. RBC assumes that the geons are retrieved from the images based on some key features, called non-accidental properties since they do not change when looking at the object from different angles, i.e., shapes configurations that are unlikely to be verified by pure chance. Examples of non-accidental properties are parallel lines, symmetry, or Y-junctions (three edges that meet at a single point). These qualitative descriptions could be sufficient to distinguish different classes of objects, but not to discriminate within a class of objects having the same basic components. Moreover, this model is inadequate to differentiate dissimilar forms in which the perceptual similarities of the forms can induce to recognize similar objects between them that instead are composed of different geons.

4.5 Toward 3D Reconstruction of Objects

347

4.5.1 From Image to Surface: 2.5D Sketch Map Extraction This section describes how a vision system can describe the observed surface of objects by processing the intensity values of the acquired image. The vision system can include different procedures according to the Marr theory which provides for different processes of early vision and the output from which it gives a first description of the surface as seen by the observer. In the previous paragraphs, we have seen that this first description extracted from the image is expressed in terms of elementary structures (primal sketch) or in terms of higher-level structures (full primal sketch). Such structures essentially describe the 2D image rather than the surface of the observed world. To describe instead the 3D surface of the world, it will be necessary to develop further vision modules that, by giving the input, the image and/or the structures extracted from the image, it should be possible to extract the information associated with the physical structures of the 3D surface. This information can be the distance of different points of the surface from the observer, the shape of the solid surface at different distances and different inclination with respect to the observer, the motion of the objects between them and with respect to the observer, the texture, etc. The human vision system is considered as the vision machine of excellence to be inspired to study how the nervous system, by processing the images of each retina (retinal image), correctly describes the 3D surface of the world. The goal is to study the various modules of the vision system considering: (a) (b) (c) (d)

The type of information to be represented; The appropriate algorithm for obtaining this information; The type of information representation; The computational model adequate with respect to the algorithm and the chosen representation.

In other words, the modules of the vision system will be studied not on the basis of ad hoc heuristics but according to the Marr theory, introduced in the previous paragraphs, which provides for each module a methodological approach, known as the Marr paradigm, divided into three levels of analysis: computational, algorithmic, and implementation. The first level, the computational model, deals with the physical and mathematical modeling with which we intend to derive information from the physical structures of the observable surface, considering the aspects of the image acquisition system and the aspects that link the physical properties of the structures of the world (geometry and reflectance of the surface) with the structures extracted from the images (zero crossing, homogeneous areas, etc.). The second level, the algorithmic level, addresses how a procedure is realized that implements the computational model analyzed in the first level and identifies the input and output information. The procedure chosen must be robust, i.e., it must guarantee acceptable results even in conditions of noisy input data.

348

4 Paradigms for 3D Vision

The third level, the implementation level, deals with the physical modality of how the algorithms and information representation are actually implemented using a hardware with an adequate architecture (for example, neural network) and software (programs). Usually, a vision system operates in real-time conditions. The human vision system implements vision algorithms using a neural architecture. An artificial vision system is normally implemented on standard digital computers or on specialized hardware, which efficiently implements vision algorithms, which require considerable computational complexity. The three levels are not completely independent, it may be that the choice of a particular hardware may influence the algorithm or vice versa. Given the complexity of a vision system, it is plausible that for the complete perception of the world (or to reduce the uncertainty of perception) it is necessary to implement different modules, each specialized to recover a particular information such as depth, orientation of the elementary surface, texture, color, etc. These modules cooperate together for the reconstruction of the visible 3D surface (of which one or more 2D images are available), according to Marr’s theory inspired by the human visual system. In the following paragraphs, we will describe some methodologies used for extracting the shape of objects from the acquired 2D images. The shape is intended as orientation and geometry information of the observed 3D surface. Since there are different methods for extracting form information, the generic methodology can be indicated with X and a given methodology can be called Shape from X. With this expression we mean the extraction of the visible surface shape based on methodology X using information derived from the acquired 2D image. The following methodologies will be studied: Shape from Shading Shape from Stereo Shape from Stereo photometry Shape from Contour Shape from Motion Shape from Line drawing Shape from (de)Focus Shape from Structured light Shape from Texture Basically each of the above methods produces according to Marr’s theory a 2.5D primal sketch map, for example, the depth map (with the methodology Shape from Stereo and Structured light), the surface orientation map (with the methodology Shape from Shading), and the orientation of the source (using the methodologies Shape from shading and Shape from Stereo photometry). The human visual system perceives the 3D world by integrating (not yet known how it can happen) 2.5D sketch information deriving the primal sketch structures from the pair of 2D images formed on the retinas. The various information of shading and shape are normally used, for example, with ability by an artist when he represents in a painting, a two-dimensional projection of the 3D world.

4.5 Toward 3D Reconstruction of Objects

349

Fig.4.28 Painted by Canaletto known for its effectiveness in creating wide-angle perspective views of Venice with particular care for lighting conditions

Figure 4.28, a painting by Canaletto, shows a remarkable perspective of a view of Venice and effectively highlights how the observer is stimulated to have a 3D perception of the world while observing its 2D projection. This 3D perception is stimulated by the shading information (i.e., by the nuance of the color levels) and by the perspective representation that the artist was able to memorize in painting the work. The Marr paradigm is a plausible theoretical infrastructure for realizing an artificial vision system, but unfortunately, it is not performing to realize vision machines for automatic object recognition. In fact, it is shown that many of the vision system modules are based on ill-posed computational models, that is, they do not result in a single solution. In fact, many of the methods of Shape from X are formulated in the solution of inverse problems, in this case, for the recovery of the 3D shape of the scene starting from limited information, i.e., the 2D image normally with noise of which we do not know model. In other words, vision is the inverse problem of the process of image formation. As we shall see in the description of some vision modules, there are methods for realizing well-posed modules, that are producing efficient and unambiguous results, through a process of regularization [15]. This is possible by imposing constraints (for example, the local continuity of the intensity or geometry of the visible surface) to transform an inverse problem into a well-posed problem.

350

4 Paradigms for 3D Vision

4.6 Stereo Vision A system based on stereo vision extracts distance information and surface geometry of the 3D world by analyzing two or more images corresponding to two or more different observation points. The human visual system uses binocular vision for the perception of the 3D world. The brain processes the pair of images, 2D projections on the retina, of the 3D world, observed from two slightly different points of view, but lying on the horizontal. The basic principle of binocular vision is as follows: objects positioned at different distances from the observer, are projected into the retinal images in different locations. A direct demonstration of the principle of stereo vision, you can have by looking at your own thumb at different distances closing first one eye and then the other. It will be possible to observe how in the two retinal images the finger will appear in two different locations moved horizontally. This relative difference in a location in the retina is called disparity. From Fig. 4.29 can be understood, as in a human binocular vision, retinal images are slightly different in relation to the interocular distance of the eyes and in relation to the distance and angle of the observed scene. It is observed that the image generated by the fixation point P is projected in the foveal area due to the convergence movements of the eyes, that is, they are projected in corresponding points of the retinas. In principle, the distance from any object that is being fixed could be determined based on the observation directions of the two eyes, through triangulation. Obviously, with this approach, it would be fundamental to know precisely the orientation of each eye which is limited in humans while it is possible to make binocular vision machines with good precision of the fixing angles as we will see in the following paragraphs. Given the binocular configuration, points closer or further away from the point of fixation project their image at a certain distance from each fovea. The distance between the image of the fixation point and the image of each of the other points

L

Fig. 4.29 Binocular disparity. The fixation point is P whose image is projected on the two fovea by stimulating corresponding main retinal zones. The other points V and L, not fixed, stimulate other zones of the two retinas that can be homologous having binocular disparity or not being correspondence, and therefore, it is perceived as double point

P

Fixation point

θ

V

d

D

Vs

P s Ls

Ld Pd

Vd

4.6 Stereo Vision

351

projected on the relative retinas is called retinal disparity. Therefore, we observe that for the fixation point P projected in each fovea of the eyes we have horizontal disparity zero while for the points L and V we have a horizontal disparity different from zero that can be calculated as a horizontal distance with respect to the fovea of each retina or as the relative distance of the respective projections of L and V on the two retinas. We will see that it is possible to use the relative horizontal disparity to estimate the depth of an object instead of the horizontal retinal disparity (distance between the projections LL PL and PL VL , similarly for the right eye) not knowing the distance d from the fixation point P. The human binocular system is able to calculate this disparity and to perceive the information of depth between the points in the vicinity of the fixation point included in the visual field. The perception of depth tends to decrease when the distance of the fixation point exceeds 30–40 m. The brain uses relative disparity information to derive depth perception, i.e., the distance measure between object and observer (see Fig. 4.29). The disparity is characterized not only by its value but also by the sign. The point V , closer to the observer, is projected onto the retina outside the fixation point P and generates positive disparity, conversely, the farthest point L is projected within P and generates negative disparity. We point out that if the retinal distance is too large the images of the projected objects (very close or very far from the observer) will not be attributed to the same object thus resulting in a double vision (problem of the diplopia). In Sect. 3.2 Vol. I we described the components of the human visual system that in the evolutionary path developed the ability to combine the two retinal images in the brain to produce a single 3D image of the scene. Aware of the complexity of interaction of the various neuro-physical components of the human visual system, it is useful to know the functional mechanisms in view of creating a binocular vision machine capable of imitating some of its functionalities. From the schematic point of view, according to the model proposed in 1915 by Claude Worth, human binocular vision can be articulated in three phases: 1. Simultaneous acquisition of image pairs. 2. Fusion of each pair. 3. Stereopsis, or the ability in the brain to extract 3D information of the scene and position in the space of objects starting from the horizontal visual disparity. For a binocular vision machine, the simultaneous acquisition of simultaneous image pairs is easily achieved by synchronizing the two monocular acquisition systems. The phases of fusion and of stereopsis are much more complex.

4.6.1 Binocular Fusion Figure 4.29 shows the human binocular optical system of which the interocular distance D is known with the ocular axes assumed coplanar. The ocular convergence allows the focusing and fixing of a point P of the space. Knowing the angle θ between

352

4 Paradigms for 3D Vision

P

Fixation point

R

Q

Horopter

Nodal Points

Rs

Ps

Qs

Rd

Qd Pd

Fig. 4.30 The horopter curve is the set of points of view that projected on the two retinas stimulate corresponding areas and are perceived as single elements. The P fixation point stimulates the main corresponding zones, while simultaneously other points, such as Q that are on the horopter, stimulate correspondences and are seen as single elements. The R point, outside the horopter, is perceived as a double point (diplopia) as it stimulates retinal zones unmatched. The shape of the horopter curve depends on the distance of the fixation point

the ocular axes with a simple triangulation it would be possible to calculate the distance d = D/2 tan(θ/2) between observer and fixation point. Looking at the scene from two slightly different viewpoints on the two retinal images you will have slightly shifted points in the scene. The problem of the fusion process is to find in the two retinas homologous points of the scene in order to produce a single image, as if the scene were observed by a cyclopic eye placed in the middle of the two. In the human vision, this occurs at the retinal level if the visual field is spatially correlated with each fovea where the disparity assumes a value of zero. In other words, the visual fields of the two eyes have a reciprocal bond between them, such that, a retinal zone of an eye placed at a certain distance from the fovea, finds in the other eye a corresponding homologous zone, positioned on the same side and at the same distance from your own fovea. Spatially scattered elementary structures of the scene projected into the corresponding retinal areas stimulate them, propagating signals in the brain that allow the fusion of retinal images and the perception of a single image. Therefore, when the eyes converge fixating an object this is seen as single as the two corresponding main areas (homologous) of the fovea are stimulated. Simultaneously, other elementary structures of the object, included in the visual field, although not fixed, can be perceived as single because they fall back and stimulate the other corresponding retinal (secondary) zones. Points of space (seen as single, that stimulate corresponding areas on the two retinas) that are at the same distance from the point of fixation form a circumference that passes through the point of fixation and for the nodal points of the two eyes (see Fig. 4.30). The set of points that are at the same distance from the point of fixation that induces corresponding positions on the retina (zero disparity) form the

4.6 Stereo Vision

353

horopter. It is shown that these points have the same angle of virginity b and lie on the circumference of Vieth-Müller which contains the fixation point and the nodal points of the eyes. In reality, the shape of the horopter changes with the distance of the fixation point, the contrast, and the lighting conditions. When the fixation distance increases the shape of the horopter tends to become a straight line until it returns a curve with the convexity facing the observer. All the points of the scene, included in the visual field, but located outside the horopter stimulate non-corresponding retinal areas, and therefore, will be perceived blurred or tendentially double, thus giving rise to the diplopia (see Fig. 4.30). Although all the points of the scene outside the horopter curve were defined as diplopics, in 1858 Panum showed that there is an area near the horopter curve within which these points, although stimulating retinal zones that are not perfectly correspondent, are still perceived as single. This area in the vicinity of the horopter is called Panum area (see Fig. 4.31a). It is precisely these minimal differences between the retinal images, relative to the points of the scene located in the Panum area, which are used in the stereoscopic fusion process to perceive depth. Points at the same distance as that of fixation from the observer produce zero disparity. Points closer to and farther from the fixation point produce disparity measurements through a neurophysiological process. The retinal disparity is a condition in which the line of sight of the two eyes do not intersect at the point of fixation, but in front of or behind the fixation point (Panum area). In processing the closest object, the visual axes converge, and the visual projection from an object in front of the fixation point leads to Crossed Retinal Disparity (CRD). On the other hand, in processing the distant object, the visual axes diverge and the visual projection from an object behind the fixation point leads to Uncrossed Retinal Disparity (URD). In particular, as shown in Fig. 4.31b the nearest V point P induces CRD disparities and its image VL is shifted to the left in the left

(a)

(b)

(c)

Diplopia Crossed disparity Panum area

Horopter

PL

PR

Uncrossed disparity

P

V

PL

VL

VR PR

PL

LL

er pt ro Ho

P

P Diplopia

L

Ho rop ter

PR

LR

Fig. 4.31 a Panum area: the set of points of the field of view, in the vicinity of the horopter curve, where despite stimulating retinal zones that are not perfectly corresponding, they are still perceived as single elements. b Objects within the horopter (closer to the observer than the fixation point P) induce retinal disparity crossed. c Objects outside the horopter (farther from the observer than the fixation point P) induce retinal disparity uncrossed

354

4 Paradigms for 3D Vision

eye and to the right VR in the right eye. For the point L farther than the fixation point P, its projection induces disparity URD, that is, the image LL is shifted to the right in the left eye and to the left LR in the right eye (see Fig. 4.31c). It is observed that the nearest point V to the eyes has greater disparity.

4.6.2 Stereoscopic Vision The physiological evidence of depth perception through the estimated disparity value is demonstrated with the stereoscope invented in 1832 by the physicist C. Wheatstone (see Fig. 4.32). This tool allows the viewing of stereoscopic images. The first model consists of a reflection viewer composed of two mirrors positioned at an angle of 45◦ with respect to the respective figures positioned at the end of the viewer. The two figures represent the image of the same object (in the figure appears a rectangular pyramid trunk) represented with a slightly different angle. An observer could approach the mirrors and turn away the supports of the two representations until the two images reflected in the mirrors did not overlap perceiving a single 3D object. The principle of operation is based on the fact that to observe an object closely, the brain usually tends to converge the visual axes of the eyes, while in the stereoscopic view the visual axes must point separately and simultaneously on the images on the left and on the right to merge them into a single 3D object. The observer actually looks at the two sketch images IL and IR of the object through the mirrors SL and SR , so that the left eye sees the sketch IL of the object, simulating the acquisition made with the left monocular system, and the right eye similarly sees the sketch IR of the object that simulates the acquisition of the photo made by the right monocular system. In these conditions, the observer, looking at the images formed in the two mirrors SL and SR , has a clear perception of depth having the impression of being in front of a solid figure. An improved and less cumbersome version of the reflective stereoscope was made by D. Brewster. The reflective stereoscope is still used today for photogrammetric relief.

Mirrors Mirrors

IR

IL SL

SR IR

IL OL

OR

Fig. 4.32 The mirror stereoscope by Sir Charles Wheatstone from 1832 that allows the viewing of stereoscopic images

4.6 Stereo Vision

355

(a)

(b) LL

FL

FL

OL

A B

OR LR

FR

FR

Fig. 4.33 a Diagram of a simple stereoscope that allows the 3D recomposition of two stereograms with the help of magnifying lenses (optional). With the lenses, the task of the eyes is facilitated to remain parallel as if looking at a distant object and focus, at a distance of about 400 mm, simultaneously the individual stereograms with the left and right eye, respectively. b Acquisition of the two stereograms

Figure 4.33 schematizes a common stereoscope showing how it is possible to have the perception of depth (relief) AB observing the photos of the same object acquired from slightly different points of view (with disparity compatible with the human visual system). The perceived relief is given by the length of the object represented by the segment AB perpendicular to the vertical plane passing through the two optical centers of the eyes. The disparity, that is, the difference of the position in the two retinas of an observed point of the world, is measured in terms of pixels in the two images or as a measure of angular discrepancy (angular disparity). The brain uses disparity information to derive depth perception (see Fig. 4.29). The stereoscopic approach is based on the fact that the two images that form on the two eyes are slightly different and the perception of depth arises from the cerebral fusion of the two images. The two representations of Fig. 4.33a must be the stereo images that reproduce the object as if it were photographed with two monocular systems with parallel axes (see Fig. 4.33b), distant 65−70 mm (average distance between the two eyes). Subsequently, alternative techniques were used for viewing stereograms. For example, in the stereoscopic cinematic view, the films have each frame consisting of two images (the stereoscopic pair) of complementary colors. The vision of these films takes place through the use of glasses, whose lenses have complementary colors to those of the images of the stereoscopic pair. The left or right lens has a color complementary to that of the image more to the left, or more to the right, in the frame. In essence, the two eyes see the two images separately as if they were true and can reconstruct the depth. This technique has not been very successful in the field of cinema. This method of viewing stereograms for the 3D perception of objects starting from two-dimensional images in the photographic technique is called ana-glyph. In 1891 Louis Arthus Ducos du Hauron produced the first anaglyphic images starting from the negatives of a pair of stereoscopic images taken from two objectives placed side by side and printing them both, almost overlapping, in complementary colors

356

4 Paradigms for 3D Vision

(red and blue or green) on a single support and then observe with glasses having, in place of lenses, two filters of the same complementary colors. The effect of these filters is to show each eye only one of the stereo image pairs. In this way, the perception of the 3D object is only an illusion, since the physical object does not exist, what the human visual system really acquires from the anaglyphs are the 2D images, the object projected on the two retinas and the brain evaluates a disparity measure for each elementary structure present in the images (remember that the image pair make up the stereogram). The 3D perception of the object, realized with the stereoscope, is identical to the same sensation that the brain would have when seeing the 3D physical object directly. Today anaglyphs can be generated electronically by displaying the pair of images of a stereogram on a monitor, perceiving the 3D object by observing the monitor with glasses containing a red filter on one eye, and a green filter on the other. The pairs of images can also be displayed superimposed on the monitor with shades of red on the left image, respectively, and with shades of green on the right image. Since the two images, 2D projections of the object observed from slightly different points of view, observed superimposed on the monitor, the 3D object of origin would be confused, but with glasses with red and green filters, the brain fuses the two images perceiving the 3D source object. A better visualization is obtained by alternatively displaying on the monitor the pair of images with a certain frequency compatible with the persistence capacity on the retina of the images.

4.6.3 Stereopsis Julesz [16] in 1971 showed how it can be relatively simple to evaluate disparity to get depth perception and the neurophysiological evidence that neural cells in the cortex are enabled to select elementary structures in pairs of retinal images and measure the disparity present. It is not yet evident in biological terms how the brain fuses the two images giving the perception of the single 3D object. He conceived the random-dot stereograms (see Fig. 4.34) as a tool to study the working mechanisms of the binocular vision process. The stereograms random-dot are generated by a computer producing two equal images of point-like structures randomly arranged with uniform density, essentially generating a texture of black and white dots. Next, two central windows are selected (see Fig. 4.35a and b) in each stereogram which are shifted horizontally by the same amount D, respectively, to the right in the left stereogram and to left in the right one. After moving to the right and left of the central square (homonymous or uncrossed disparity), the remaining area without texture is adequately filled with the same background texture (black and white random points). In this way, the two central windows are immersed and camouflaged in the background with an identical texture. When each stereogram is seen individually it appears (see Fig. 4.34) as a single texture of randomly arranged black and white dots. When the pair of stereograms is seen instead binocularly, i.e., an image is seen from one eye, and the other from

4.6 Stereo Vision

357

Fig. 4.34 Random-dot stereogram

(a)

(b)

(c)

Fig. 4.35 Construction of the Julesz stereograms of Fig. 4.34. a The central window is shifted to the right horizontally by D pixels and the void left is covered by the same texture (white and black random points) of the background. b The same operation is repeated as in (a) but with a horizontal shift D to the left. c If the two stereograms constructed are observed simultaneously with a stereoscope, due to the disparity introduced, the central window is perceived high with respect to the background

the other eye, the brain performs the stereopsis process that is, it fuses the two stereograms perceiving depth information, noting that the central square rises with respect to the texture of the background, toward the observer (see Fig. 4.35c). In this way, Julesz has shown that the perception of the emerging central square at a different distance (as a function of D) from the background, is only to be attributed to the disparity determined by the binocular visual system that surely performs first the correspondence activity between points of the background (zero disparity) and between points of the central texture with disparities D. In other words, the perception of the central square raised by the background is

358

4 Paradigms for 3D Vision

Fig. 4.36 Stereograms of random points built with crossed disparity, or with the central window moved horizontally outward in both stereograms. The stereo vision of the stereograms, with the overlapping central windows of green and red color, occurs using glasses with filters of the same color of the stereograms constructed

due only to the disparity measure (no other information is used) that the brain realizes through the process of stereogram fusion. This demonstration makes it difficult to support any other stereo vision theory that is based, for example, on the a priori knowledge of what is being observed or on the fusion of particular structures of monocular images (for example, the contours). If in the construction of the stereograms the central windows are moved in the opposite direction to those of Fig. 4.35a and b, that is, to the left in a stereogram and to the right in the other (crossed disparity), with the stereoscopic vision the central window is perceived emerging from the opposite side of the background or moves away from the observer (see Fig. 4.36). Within certain limits the perceived relief is easier the more the disparity (homonymous or crossed) is high.

4.6.4 Neurophysiological Evidence of Stereopsis Different biological systems use stereo vision and others (rabbits, fish, etc.) observe the world with panoramic vision, that is, their eyes are placed to observe different parts of the world, and unlike stereo vision, pairs of images do not overlapping areas of the observed scene for depth perception. Studies by Wheatstone and Julesz have shown that binocular disparity is the key feature for stereopsis. Let us now look at some neurophysiological mechanisms related to retinal disparity. In Sect. 3.2 Vol. I it was shown how the visual system propagates the light stimuli on the retina and how impulses propagate from this to the brain components. The stereopsis process uses information from the striate cortex and other levels from the visual binocular system to represent the 3D world. The stimuli coming from the retina through the optical trait (containing fibers of both eyes) are transmitted up to the lateral geniculate nucleus—LNG which functions as a thalamic relay station, subdivided into 6 laminae, for the sorting of the

4.6 Stereo Vision

359

different information (see Fig. 4.37). In fact, the fibers coming from the single retinas are composed of axons deriving both from the large ganglion cells (of type M ), and from small ganglion cells (of the type P) and from small ganglion cells of the type non-M and non-P, called koniocellular or K cell. The receptive fields of the ganglion cells are circular and of center-ON and center-OFF type (see Sect. 4.4.1). The M cells are connected with a large number of photoreceptor cells (cones and rods) through the bipolar cells and for this reason, they are able to provide information on the movement of an object or on rapid changes in brightness. The P cells are connected with fewer receptors and are suitable for providing information on the shape and color of an object. In particular, some different peculiarities between the M and P cells should be highlighted. The former is not very sensitive to different wavelengths, very selective at low spatial frequencies, high temporal response and conduction velocity, and wide dendritic branching. The P cells, on the other hand, are selective at different wavelengths (color) and for high spatial frequencies (useful for capturing details having small receptive fields), have low conduction speed and temporal resolution. The K cells are very selective at different wavelengths and do not respond to orientation. Laminae 1 and 2 receive the signals of the M cells, while the remaining 4 laminae receive the signals of the P cells. The interlaminar layers receive the signals from the koniocellular cells K. The receptive fields of the K cells are also circular and of center-ON and center-OFF type. How P cells are color-sensitive but with the specificity that receptive fields are opponents for red-green and blue-yellow.4 As shown in the figure, the information of the two eyes is transmitted separately to the different LGN laminae in such a way that the nasal hemiretina covers the hemicamp view of the temporal side, while the temporal hemiretina of the opposite eye includes the hemicamp view of the nasal side. Only in the first case, the information of the two eyes intersect. In particular, laminae 1, 4, and 6 receive information from the nasal retina of the opposite eye (contralateral), while laminae 2, 3, and 5 from

4 In addition to the trichromatic theory (based on three types of cones sensitive to red, green and blue, the combination of which determines the perception of color in relation to the incident light spectrum) was proposed by Hering (1834–1918), the theory of opponent color. According to this theory, we perceive colors by combining 3 pairs of opponent colors: red-green, blue-yellow, and an achromatic channel (white-black) used for brightness. This theory foresees the existence in the visual system of two classes of cells, one selective for the color opponent (red-green and yellowblue) and one for brightness (black-white opponent). In essence, downstream of the cones (sensitive to red, green, and blue) adequate connections with bipolar cells would allow to have ganglion cells with the typical properties of chromatic opponency, having a center-periphery organization. For example, if a red light affects 3 cones R, G, and B connected to two bipolar cells β1 and β2 with the following cone-cell connection configuration β1 (+R, −G) and β2 (−R, −G, +B), we would have an excitation of the bipolar cell β1 stimulated by the cone R, sending the signal +R − G on its ganglion cell. On the other hand, a green light inhibits both bipolar cells. A green or red light inhibits the bipolar cell β2 while a blue light signals to its cell ganglion the signal +B − (G + R). Hubel and Wiesel demonstrated, the presence of cells of the retina and the lateral geniculate nucleus (the P and the K cells), which respond to the chromatic opponency properties organized with the properties of the center-ON and center-OFF receptive fields.

360

4 Paradigms for 3D Vision

the temporal retina of the eye of the same side (ipsilateral). In this way, each lamina contains a representation of the contralateral visual hemifield (of the opposite side). With this organization of information, in the LGN the spatial arrangement of the receptive fields associated with ganglion cells is maintained and in each lamina the complete map of the field of view of each hemiretina is stored (see Fig. 4.37).

4.6.4.1 Structure of the Primary Visual Cortex The signals of both the two types of ganglion cells are propagated in parallel by the LGN, through their neurons, toward different areas of the primary visual cortex or V1, also known as striated cortex (see Fig. 4.37). At this point, it can be stated that the signals coming from the two retinas have already undergone a pre-processing and exiting from the LGN layers, the information of the topographic representation of the fields of view, although separated in the various LGN layers, is maintained and will continue to be so also with the propagation of signals toward the cortex V1. The primary visual cortex is composed of a structure of 6 horizontal layers about 2 mm thick (see Fig. 4.38) each of which contains different types of neural cells (central body, dendrites, and axons of various proportions) estimated at a total of 200 million. The organization of the cells in each layer is vertical and the cells are aligned in columns perpendicular to the layers. The layers in each column are connected through the axons, making synapses along the way. This organization allows the cells to activate simultaneously for the same stimulus. The flow of information to some areas of the cortex can propagate up or down the layers. In the cortex V1, from the structural point of view, two types of neurons are distinguished: stellate cells and pyramidal cells. The first are small with spiny dendrites. The latter has only one large apical dendrite which, as we shall see, extends into all the layers of the cortex V1. In the layer I comes the set of the distal dendrites of the pyramidal neurons and the axons of the koniocellular pathways. These last cells make synapses in the layers II and III where there are small stellate and pyramidal cells. The cells of the layers II and III make synapses with the other cortical areas. The IV layer of the cortex is divided into substrates IV A, IV B, and IV C. The IV C layer has a further subdivision into IV Cα, and IV Cβ because of the different connectivity found between the cells of the upper and lower parts of these substrates. The propagation of information from the LGN to the primary visual cortex occurs through the P (via parvocellular), M (via magnocellular) and K (via koniocellular) channels. In particular, the axons of M cells transmit the information ending in the substrate IV Cα, while the axons of the P cells end in the substrate IV Cβ. The substrate IVB receives the input from the substrate IV Cα and its output is transmitted in other parts of the cortex.

4.6 Stereo Vision

361 P Fixation point C

D

Binocular iew

B

E

A

F

Monocular view

Monocular iew

AF

E Nasal D hemiretina C

E

B

Chiasm

B Temporal C hemiretina

D

Optic nerves

Contralateral

ED ED

6

5

Optic tracts

ED

Ipsilateral

FED

4

3

FED

2

FED

CBA CBA CBA

1 Left LGN

2

4 3

5

6 CB CB CB

1

Channel K Channel M

Right LGN Channel P

Channel P

Primary visual cortex

Fig. 4.37 Propagation of visual information from retinas to the Lateral Geniculate Nucleus (LGN) through optic nerves, chiasm, and optic tracts. The information from the right field of view passes to the left LGN and vice versa. The left field of view information is processed by the right LGN and vice versa. The left field of view information that is seen by the right eye does not cross and is processed by the right LGN. The opposite situation occurs for the right field of view information seen by the left eye that is processed by the left LGN. The spatial arrangement of the field of view is reversed and mirrored on the retina but the information toward the LGN propagates while maintaining the topographic arrangement of the retina. The relative disposition of the hemiretinas is mapped on each lamina (in the example, the points A, B, C, D, E, F)

362

4 Paradigms for 3D Vision

I II III IV A IV B

Cross section of the monkey visual cortex

Towards other Cortical Areas (e.g. V2, MT)

Complex Cells Simple cells Simple cells Pyramidal cells

V3,MT

IV Cα

Stellate Cells

M

IV Cβ

Stellate Cells

P

V

Complex Cells

VI

Complex Cells

Channel K

Channel P

6 3

to the superior colliculus

Channel M

1

4

5

2

Right LGN

Fig. 4.38 Cross-section of the primary visual cortex (V1) of the monkey. The six layers and relative substrates with different cell density and connections with other components of the brain are shown

The layers V and VI , with high cell density even pyramidal, transmit their output back to the upper colliculus5 and to the LGN, respectively. As shown in Fig. 4.37 the area of the left hemisphere of V 1 receives only the visual information related to the right field of view and vice versa. Furthermore, the information that reaches the cortex from the retina is organized in such a way as to maintain the hemiretinas of origin, the cell type (P or M ) and the spatial position of the ganglion cells inside the retina (see Fig. 4.42). In fact, the axons of the cells M and P transmit the information of the retinas, respectively, in the substrates IV Cα and IV Cβ. In addition, the cells close together in these layers receive information from the local areas of the retina, thus maintaining the topographical structure of origin.

4.6.4.2 Neurons of the Primary Visual Cortex Hubel and Wiesel (winners of the Nobel Prize in Physiology or medicine in 1981) discovered some types of cells present in the cortex V1, called, simple, complex, and hypercomplex cells. These cortical cells are sensitive to stimuli for different spatial orientations with a resolution of approximately 10◦ . The receptive fields of cortical cells, unlike those with a circular shape center-ON and center-OFF of LGNs and ganglion cells, are rather long (rectangular) and more extensive, and are classified into three categories of cells:

5 Organ

that controls saccadic movements, coordinates visual and auditory information, directing the movements of the head and eyes in the direction where the stimuli are generated. It receives direct information from the retina and from different areas of the cortex.

4.6 Stereo Vision

(a)

363

(b)

(c) Movement direction

of Simple Cell

- +-

Basic level

1

1

2

2

3

3

4

4

5

5

6

6

7

7

Light bar

- +- +-

Weak response

Strong response

of Complex Cell

Receptive Fields of Hypercomplex Cell

Fig. 4.39 Receptive fields of cortical cells associated with the visual system and their response to the different orientations of the light beam. a Receptive field of the simple cell of ellipsoidal shape with respect to the circular one of ganglion cells and LGN. The diagram shows the maximum stimulus only when the light bar is totally aligned with the ON area of the receptive field while it remains inhibited when the light falls on the OFF zone. b Responses of the complex cell, with a rectangular receptive field, when the inclination of a moving light bar changes. The arrows indicate the direction of motion of the stimulus. From the diagram, we note the maximum stimulation when the light is aligned with the axis of the receptive field and moves to the right, while the stimulus is almost zero with motion in the opposite direction. c hypercomplex cell responses when solicited by a light bar that increases in length by exceeding the size of the receptive field. The behavior of the cells (called with end-stopped cells) is such that the stimulus increases reaching the maximum when the light bar completely covers the receptive field but decreases its activity if stimulated with a larger light bar

1. Simple cells, they present the excitatory and inhibitory areas rather narrow and elongated, having a specific orientation axis. These cells are functional as detectors of linear structures, in fact, they are well stimulated when a rectangular light beam is located in an area of the field of view and oriented in a particular direction (see Fig. 4.39a). The receptive fields of simple cells seem to be realized by the convergence of different receptive fields of adjacent cells of the substrate IV C. The latter, known as stellate cells are small-sized neurons with circular receptive fields that receive signals from the cells of the geniculate body (see Fig. 4.40a) which, like retinal ganglion cells, are center-ON and center-OFF type. 2. Complex Cells, have extended receptive fields but not a clear zone of excitation or inhibition. They respond well to the motion of an edge with a specific orientation and direction of motion (good motion detectors, see Fig. 4.39b. Their receptive fields seem to be realized by the convergence of different receptive fields of more

364

4 Paradigms for 3D Vision

(a)

(b) +-

lls

+-

ce

+-

+-

Receptive Fields of Simple Cells

-

-

+

+-

Output from LGN cells

-

+

N LG

+-

+

of

s ld ie eF tiv ells p ce N c Re f LG o +-

Output from LGN cells

+Stellate Cells of layer IV C

+Cortical Simple Cell (CS)

ld

p ce Re

ie eF tiv

of

CS

Stellate Cells of layer IV C

Simple Cortical Cells Complex Cortical Cells

Fig. 4.40 Receptive fields of simple and complex cortical cells, generated by multiple cells with circular receptive fields. a Simple cell generated by the convergence of 4 stellate cells receiving the signal from adjacent LGN neurons with circular receptive fields. The simple cell with an elliptic receptive field responds better to the stimuli of a localized light bar oriented in the visual field. b A complex cell generated by the convergence of several simple cells that responds better to the stimuli of a localized and oriented bar (also in motion) in the visual field

simple cells (see Fig. 4.40b). The peculiarity of motion detection is due to two phenomena. The first occurs when the axons of different simple cells adjacent and with the same orientation, but not with identical receptive fields, converge on a complex cell which determines the motion from the difference of these different receptive fields. The second occurs when the complex cell can determine motion through different latency times in the responses of adjacent simple cells. Complex cells are very selective in a given direction, responding only when the stimulus moves in one direction and not in the other (see Fig. 4.39c). Compared to the simple one, complex cells are not conditioned to the position of the light beams (in stationary conditions) in the receptive field. The amount of the stimulus also depends on the length of the rectangular light beam that falls within the receptive field. They are located in layers II and III of the cortex and in the boundary areas between layers V and VI . 3. Hypercomplex cells, are a further extension of the process of visual information processing and advancement of the knowledge of the biological visual system. Hypercomplex cells (known as end-stopped cells) respond only if a light stimulus has a given ratio between the illuminated surface and the dark surface, or comes from a certain direction, or includes moving forms. Some of these hypercomplex cells respond well only to rectangular beams of light of a certain length (completely covering the receptive field), so that if the stimulus extends beyond this length, the response of the cells is significantly reduced (see Fig. 4.39c). Hubel and Wiesel characterize these receptive fields as containing activating and antagonistic regions (similar to excitatory/inhibitory regions). For example, the left half of a receptive field can be the activating region, while the antagonistic region is on the right. As a result, the hypercomplex cell will respond, with the spatial summation, to stimuli on the left side (within the activation region) to the extent that

4.6 Stereo Vision

365

it does not extend further into the right side (antagonistic region). This receptive field would be described as stopped at one end (i.e., the right). Similarly, hypercomplex receptive fields can be stopped at both ends. In this case, a stimulus that extends too far in both directions (for example, too left or too far to the right) will begin to stimulate the antagonistic region and reduce the signal strength of the cell. The hypercomplex cells occur when the axons of some complex cells, with adjacent receptive fields and different in orientation, converge in a single neuron. These cells are located in the secondary visual area (also known as V5 and MT). Following the experiments of Hubel and Wiesel, it was discovered that even some simple and complex cells exhibit the same property as the hypercomplex, that is, they have end-stopping properties when the luminous stimulus exceeds a certain length overcoming the margins of the same receptive field. From the properties of the neural cells of the primary visual cortex a computational model emerges with principles of self-learning that explain the sensing and motion capacities of structures (for example, lines, points, bars) present in the visual field. Furthermore, we observe a hierarchical model of visual processing that starts from the lowest level, the level of the retinas that contains the scene information (in the field of view), the LGN level that captures the position of the objects, the level of simple cells that see the orientation of elementary structures (lines), the level of complex cells that see(detect) their movement, and the level of hypercomplex cells that perceive of the object, edges, and their orientation. The functionality of simple cells can be modeled using Gabor filters to describe their sensitivity to orientation to a linear light beam. Figure 4.41 summarizes the connection scheme between the retinal photoreceptors and the neural cells of the visual cortex. In particular, it is observed how groups of cones and rods are connected with a single bipolar cell, which in turn is connected with one of the ganglion cells from which the fibers afferent to the optic nerve originate whose exit point (papilla) is devoid of photoreceptors. This architecture suggests that stimuli from retinal areas of a certain extension (associated, for example, with an elementary structure of the image) are conveyed into a single afferent fiber of the optic nerve. In this architecture, a hierarchical organization of the cells emerges, starting from the bipolar cells that feed the ganglion cells up to the hypercomplex cells. The fibers of the optic nerve coming from the medial half of the retina (nasal field) intersect at the level of the optic chiasm to those coming from the temporal field and continue laterally. From this it follows (see also Fig. 4.37) that after crossing in the chiasm the right optic tract contains the signals coming from the left half of the visual field and the left one the signals of the right half. The fibers of the optic tracts reach the lateral geniculate bodies that form part of the thalamus nuclei: here there is the synaptic junction with neurons that send their fibers to the cerebral cortex of the occipital lobes where the primary visual cortex is located. The latter occupies the terminal portion of the occipital lobes and extends over the medial surface of them along the calcarine fissure (or calcarine sulcus).

366

4 Paradigms for 3D Vision

Photoreceptors (Cones and Rods) Bipolar Cells Ganglion Cells

Cortex V1, V2, and other areas

Right eye

} }

Left eye

} Optic chiasm

Hypercomplex cells

} } }

Complex cells Simple cells

LGN cells

} Retina

Fig. 4.41 The visual pathway with the course of information flow from photoreceptors and retinalvisual cortex cells of the brain. In particular, the ways of separation of visual information coming from nasal and temporal hemiretinas for each eye are highlighted

4.6.4.3 Columnar Organization of the Primary Visual Cortex In the second half of the twentieth century, several experiments analyzed the organization of the visual cortex and evaluated the functionality of the various neural components. In fact, it has been discovered that the cortex is also radially divided into several columns where neuronal cells that respond with the same characteristic for stimuli arising from a given point of the field of view are aggregated. This columnar aggregation actually forms functional units perpendicular to the surface of the cortex. It turns out that, by inserting a microelectrode perpendicularly to the various layers of cortex V 1, all the neuronal cells (simple and complex) that it encounters have the same stimuli that indicate the same direction, for example. Consequently, if the electrode is instead inserted parallel to the surface of the cortex, crossing several columns in the same layer, the various orientation differences would be observed. This neural organization highlights how the responses of the cells that capture the orientation information are correlated with those of the visual field maintaining, in fact, the topographic representation of the field of view. Based on these results, Hubel and Wiesel demonstrated the periodic organization of the visual cortex consisting of a set of vertical columns (each containing cells with selective receptive fields for a given orientation) called hypercolumn. The hypercolumn can be thought of as a subdivision into vertical plates of the visual cortex (see Fig. 4.42) which are repeated periodically (representing all perceived orientations).

4.6 Stereo Vision

367

Blob

Interblob area 500 μm I II III IV A IV B IV Cα IV Cβ V VI

Channel K

I Columns of ocular dominance 6(C) 5(I) 4(C) 3(I) 2(I) 1(C)

C

I

C

m 0μ ~5 Orientation column

Channel P Channel M

LGN Fig. 4.42 Columnar organization, perpendicular to the layers, of the cells in the cortex V 1. The ocular dominance columns of the two eyes are indicated with I (the ipsilateral) and with C (the contralateral). The orientation columns are indicated with oriented bars. Blob cells are located between the columns of the layers II , III and V , VI

Each column crosses the 6 layers of the cortex and represents an orientation in the visual field with an angular resolution of about 10◦ . The cells crossed by each column respond to the stimuli deriving from the same orientation (orientation column) or to the input of the same eye (dominant ocular column, dominant ocular plate). An adjacent column includes cells that respond to a small difference in orientation from the near one and perhaps to the input of the same eye or the other. The neurons in the IV layer are an exception, as they could respond to any orientation or just one eye. From Fig. 4.42 it is observed how the signals of the M and P cells, relative to the two eyes, coming from the LGN, are kept separated in the IV layer of the cortex and in particular projected, respectively, in the substrate IVCα and IVCβ where monocular cells are found with center-ON and center-OFF circular receptive fields. Therefore,

368

4 Paradigms for 3D Vision

the signals coming from LGN are associated with one of the two eyes, never to both, while each cell of the cortex can be associated with input from one eye or that of the other. It follows that we have ocular dominance columns arranged alternately associated with the two eyes (ipsi or contralateral), which extend horizontally in the cortex, consisting of simple and complex cells. The cells of the IVCα substrate propagate the signals to the neurons (simple cells) of the substrate IV B. The latter responds to stimuli from both eyes (binocular cells), unlike the cells of the IV C substrate whose receptive fields are monocular. Therefore, the neurons of the IV B layer begin the process of integration useful for binocular vision. These neurons are selective to detect movement but also to detect direction only if stimulated by a beam of light that moves in a given direction. A further complexity of the functional architecture of V 1 emerges with the discovery (in 1987) by contrast medium, along with the ocular dominance columns of another type of column, regularly spaced and localized in the layers II − III and V − VI of the cortex. These columns are made up of arrays of neurons that receive input from the parvocellular pathways and from the koniocellular pathways. They are called blob appearing (by contrast medium) as leopard spots when viewed with tangential sections of the cortex. The characteristic of the neurons included in the blobs is that of being sensitive to color (i.e., to the different wavelengths of light, thanks to the information of the channels K and P) and to the brightness (thanks to the information coming from the channel M ). Among the blobs, there are regions with neurons that receive signals from the magnocellular pathways. These regions called interblobs contain orientation columns and ocular dominance columns whose neurons are motion-sensitive and nonselective for color. Therefore, the blob are in fact modules, in which the signals of the three channels P, M , K converge, where it is assumed necessary to combine these signals (i.e., the spectral and brightness information) on which the perception of color and brightness variation depends. This organization of the cortex V 1 in hypercolumns (also known as cortical module) each of which receiving input from the two eyes (orientation columns and ocular dominance columns) is able to analyze a portion of the visual field. Therefore, each module includes neurons sensitive to color, movement, linear structures (lines or edges) for a given orientation and for an associated area of the visual field, and integrates the information of the two eyes for depth perception. The orientation resolution in the various parallel cortical layers is 10◦ and the whole module can cover an angle of 180◦ . It is estimated that a cortical module that includes a region of only 2 × 2 mm of the visual cortex is able to perform a complete analysis of a visual stimulus. The complexity of the brain is such that the functionalities of the various modules and their total number have not been clearly defined.

4.6.4.4 Area V1 Interaction with Other Areas of the Visual Cortex Before going into how the information coming from the retinas in the visual cortex is further processed and distributed, we summarize the complete pathways in the

4.6 Stereo Vision

369

visual system. The visual pathways begin with each retina (see Fig. 4.37), then leave the eye by means of the optic nerve that passes through the optic chiasm (in which there is a partial crossing of the nerve fibers coming from the two hemiretinas of each eye), and then it becomes the optic tract (seen as a continuation of the optic nerve). The optic tract goes toward the lateral geniculate body of the thalamus. From here the fibers, which make up the optic radiations, reach the visual cortex in the occipital lobes. The primary visual cortex mainly transmits to the adjacent secondary visual cortex V2, also known as area 18, most of the first processed information of the visual field. Although most neurons in the V 2 cortex have properties similar to those of neurons in the primary visual cortex, many others have the characteristic of being much more complex. From areas V 1 and V 2 the visual information processed continues toward the areas so-called associative areas, which process information at a more global level. These areas, in a progressive way, combine (associate) the first level visual information with information deriving from other sensors (hearing, touch, . . .) thus creating a multisensory representation of the observed world. Several researches have highlighted dozens of cortical areas that contribute to visual perception. Areas V 1 and V 2 are surrounded by several of these cortical areas and associative visual areas called: V 3, V 4, V 5 (orMT ), PO, TEO, etc. (see Fig. 4.43). From the visual area V 1 two cortical pathways of propagation and processing of visual information branch out [17]: the ventral path which extends to the temporal lobe and the dorsal pathway projected to the parietal lobe.

V3A V3 V2 V1 V2 V4 V3 MT V3A

Retina

Parietal cortex VIP Dorsal stream (WHERE) LGN

V1

V2

Ventral stream (WHAT)

MT

7a

MST LIP

V3

V4

TEO

TE

Temporal Cortex

Fig. 4.43 Neuronal pathways involved in visuospatial processing. Distribution of information from the retina to other areas of the visual cortex that interface with the primary visual cortex V 1. The dorsal pathway, which includes the parietal cortex and its projections to the frontal cortex, is involved in the processing of spatial information. The ventral pathway, which includes the inferior and lateral temporal cortex and their projections to the medial temporal cortex, is involved in the processing of recognition and semantic information

370

4 Paradigms for 3D Vision

The main function of the ventral visual pathway (channel of what is observed, i.e., object recognition pathway) seems to be that of conscious perception, that is, to make us recognize and identify objects by processing their intrinsic visual properties, such as shape and color, memorizing such information in memory for a long time term. The basic function of the dorsal visual pathway (channel where is an object, i.e., spatial vision pathway) seems to be the one associated with visual-motor control on objects by processing their extrinsic properties which are essential for their localization (and mobility), such as their size, their position, orientation in space, and saccadic movements. Figure 4.43 shows the connectivity between the various main areas of the cortex. The signals start from the ganglion cells of the retina, and through LGN and V 1 they branch out, by their process, toward the ventral path (from V 1 to V 4 reaching the inferior temporal cortex IT ) and dorsal (from V 1 to V 5 reaching the posterior parietal cortex) thus realizing a hierarchical connection structure. In particular, the parietomedial temporal area integrates information from both pathways and is involved in the encoding of landmarks in spatial navigation and in the integration of objects into the structural environment. The flow of information is summarized in the ventral visual channel for the perception of objects: Area V1-Primary Visual Cortex, has a retinotopic organization, meaning that it contains a complete map of the visual field covered by the two eyes. In this area, we have seen that from the image edges oriented by the local variation of brightness are detected along different orientation angles (orientation columns). They have also separated color information and detected spatial frequencies and depth perception information. Area V2-Secondary Visual Cortex, where the detected edges are combined to develop a vocabulary of intersections and junctions, along with many other elementary visual features (for example, texture, depth, . . .), fundamental for the perception of more shapes complex along with the ability to distinguish the stimulus of the background or part of the object. The neural cells of V 2, with properties similar to those of V 1, encode these features in a wide range of positions, starting a process that ends with the neurons of the lower temporal area (IT ) in order to recognize an object regardless of where it appears in the visual field. The V 2 area is organized in three modules, one of which, in the thin strip (striate cortex), receives the signals from the blobs. The other two modules receive the signals from the interblocks in the thick interstrips and strips. From V 2, and in particular from the thick strips (depositories of movement and depth), it is believed that signals are transmitted to the V 5 area. V 1 and V 2 basically produce the initial information related to color, motion analysis, and shape, which are then transmitted in parallel to the other areas of the extrastriate visual cortex, for further specialized processing. Area V3-Third visual cortex, receives many connections from area V 2 and connects with the middle temporal (MT) areas. V 3 neurons have identical properties to those of V 2 (selective ability for orientation) but many other neurons have more

4.6 Stereo Vision

371

complex properties yet to be known. Some of the latter are sensitive to color and movement, characteristics most commonly analyzed in other stages of the visual process. Area V4, receives the information flow after the process in V 1, V 2, and V 3, to continue further processing the color information (received from blobs and interblob of V 1) and form. In this area, there are neurons with properties similar to other areas but with more extensive receptive fields than those of V 1. Area still to be analyzed in-depth. It seems to be essential for the perception of extended and more complex contours. Area IT-Inferior temporal cortex, receives many connections from the area V 4 and includes complex cells that have shown little sensitivity to color and size of the perceived object but are very sensitive to the shape. Studies have led to consider this area sensitive to face recognition and important visual memory capacity. The cortical areas of the dorsal pathway that terminate in the parietal lobe, elaborate the spatial and temporal aspects of visual perception. In addition to spatially locating the visual stimulus, these areas are also linked to aspects of movement including eye movement. In essence, the dorsal visual pathway integrates the spatial information between the visual system and the environment for a correct interaction. The dorsal pathway includes several cortical areas, including in particular the Middle Temporal (MT) area, also called area V 5, the Medial Superior Temporal (MST ) area, and the lateral and ventral intraparietal areas (LIP and VIP, respectively). The MT area is believed to contribute significantly to the perception of movement. This area receives the signals from V 2, V 3, and the substrate IVB of V 1 (see Fig. 4.43). We know that the latter is part of the magnocellular pathways involved in the analysis of movement. Neurons of MT have properties similar to those of V 1, but have the most extensive receptive field (up to covering an angle of tens of degrees). They have the peculiarity of being activated only if the stimulus, which falls on its receptive field, moves in a preferred direction. The area MST is believed to contribute, as well as to the analysis of the movement, it is sensitive to the radial motion (that is, approaching or moving away from a point) and to the circular motion (clockwise or counterclockwise). The neurons of MST are also selective for movements in complex configurations. The LIP area is considered to be the interface between the visual system and the oculomotor system. Also, the neurons of the areas LIP and VIP (receive the signals from V 5 and MST ) are sensitive to stimuli generated by a limited area of the field of view and are active for stimuli resulting from an ocular movement (known also as saccade) in the direction of a given point in the field of view. The brain can use for various purposes, this wealth of information associated with movement, acquired through the dorsal pathway. It can acquire information about objects moving in the field of view, understand the nature of motion compared to that of one own by moving the eyes and then act accordingly. The activities of the visual cortex take place through various hierarchical levels with the serial propagation of the signals and their processing also in parallel through the different communication channels thus forming a highly complex network of

372

4 Paradigms for 3D Vision Inferotemporal areas Parietal areas V5(MT) V3

Blob Color Shape

Retina

Channel K Channel M LGN Channel P

Movement Depth

I II III IV A IV B IV Cα IV Cβ

V4

Movement Depth Shape Color Movement Depth Shape Color Movement Depth Shape Color

V4

Thick stripe Inter-stripe Thin stripe

Area V2

V VI

Area V1

Fig.4.44 Overall representation of the connections and the main functions performed by the various areas of the visual cortex. The retinal signals are propagated segregated through the magnocellular and parvocellular pathways, and from V 1 they continue on the dorsal and ventral pathways. The former specializes in the perception of form and color, while the latter is selective in the perception of movement, position, and depth

circuits. This complexity is attributable in part to the many feedback loops that each of these cortical areas form with their connections to receive and return information considering all the ramifications that for the visual system originate from the receptors (with ganglion neurons) and then transmitted to the visual cortical areas, through the optic nerve, chiasm and optic tract, and lateral geniculate nucleus of the thalamus. Figure 4.44 summarizes schematically, at the current state of knowledge, the main connections and the activities performed by the visual areas of the cortex (of the macaque), as the signals from the retinas propagate (segregated by the two eyes) in such areas, through the parvocellular and magnocellular channels, and the dorsal and ventral pathways. From the analysis of the responses of different neuronal cells, it is possible to summarize the main functions of the visual system realized through the cooperation of the various areas of the visual cortex. Color perception. The selectivity response to color is given by the ganglion cells P that through the parvocellular channel of LGN reaches the cells of the substrate IVBβ of V 1. From here it propagates in the other layers II and III of V 1 in vertically organized cells that form the blob. From there the signal propagates in the V 4 area directly and through the thin strips of V 2. V 4 includes cells with larger receptive fields with selective capabilities to discriminate color even with lighting changes. Perception of the form. As for the color, the ganglion cells P of the retina, through the parvocellular channel of LGN transmit the signal to the cells of the substrate

4.6 Stereo Vision

373

IVCβ of V 1 but it propagates later in the cells interblob of the other layers II and III of V 1. From here the signal propagates in the V 4 area directly and via the interstripes (also known as pale stripes) of V 2 (see Fig. 4.44). V 4 includes cells with larger receptive fields with also selective capabilities to discriminate orientation (as well as color). Perception of movement. The signal from the ganglion cells M of the retina, through the magnocellular channel of LGN, reaches the cells of the substrate IVCα of V 1. From here, it propagates in the IVB layer of V 1, which we highlighted earlier, including very selective complex cells in orientation also in relation to movement. From the layer IVB the signal propagates directly in the area V5 (MT) and through the thick strips of V 2. Depth perception. Signals from the LGN cells that enter the IVC substrates of the V 1 cortex keep the information in the two eyes segregated. Subsequently, these signals are propagated in the other layers II and III of V 1, and in these appear, for the first time, cells with afferents coming from cells (but not M and not P cells) of the two eyes, that is, we have bipolar cells. Hubel and Wiesel have classified the cells of V 1 in relation to their level of excitation deriving from one and the other eye. Those deriving from the exclusive stimulation of a single eye are called ocular dominance cells, and therefore, monocular cells. The binocular cells are instead those excited by cells of the two eyes whose receptive fields simultaneously see the same area of the visual field. The contribution of the cells of a single eye can be dominant with respect to the other or both contribute with the same level of excitation, and in the latter case, they have perfectly binocular cells. With binocular cells it is possible to evaluate depth by estimating binocular disparity (see Sect. 4.6) at the base of stereopsis. Although the neurophysiological basis of stereopsis is still not fully known, the functionality of binocular neurons is assumed to be guaranteed by the monocular cells of both eyes stimulated by the corresponding receptive fields (also with different viewing angles) as much as possible compatible in terms of orientation and position compared to the point of fixation. With reference to Fig. 4.29 the area of the fixation point P generates identical receptive fields in the two eyes stimulating a binocular cell (zero disparity) with the same intensity, while the stimulation of the two eyes will be differentiated (receptive fields slightly shifted with respect to the fovea), deriving from the farthest zone (the point L) and the nearest one (the point V ) with respect to the observer. In essence, the action potential of the most distant monocular cells (corresponding) is higher than those closer to the fixation point and this behavior becomes a property of binocular disparity. The current state of knowledge is based on the functional analysis of the neurons located in the various layers of the visual areas, their interneural connectivity, and the effects caused by the lesions in one or more components of the visual system.6

6 Biological

evidence has been demonstrated which, with the stimulation of the nerve cells of the primary visual cortex, through weak electrical impulses, causes the subject to see elementary visual

374

4 Paradigms for 3D Vision

4.6.5 Depth Map from Binocular Vision After having analyzed the complexity of the biological visual system of primates, let us now see how it is possible to imitate some of its functional capabilities by creating a binocular vision system for calculating the depth map of a 3D scene, locating an object in the scene and calculating its attitude. These functions are very useful for navigating an autonomous vehicle and for various other applications (automation of robot cells, remote monitoring, etc.). Although we do not yet have a detailed knowledge of how the human visual system operates for the perception of the world, as highlighted in the previous paragraphs, it is hypothesized that different modules cooperate together for the perception of color, texture, movement, and to estimate depth. Modules of the primary visual cortex have the task of merging images from the two eyes in order to perceive the depth (through stereopsis) and the 3D reconstruction of the visible surface observed.

4.6.5.1 Correspondence Problem One of these modules may be one based on binocular vision for depth perception as demonstrated by Julesz. In fact, with Julesz’s stereograms, made with random points, depth perception is demonstrated through binocular disparity only. It follows the need to solve the correspondence problem, that is, the identification of identical elementary structures (or similar features), in the two retinal images of left and right, which correspond to the same physical part of the observed 3D object. Considering that the pairs of images (as in the biological vision) are slightly different from each other (observation from slightly different points of view), it is plausible to think that the number of identical elementary structures, present in any local region of the retina, is small, and it follows that similar features found in the corresponding

events, such as a colored spot or a flash of light, in the expected areas of the visual field. Given the spatial correspondence one by one, between the retina and the primary visual area, the lesion of areas of the latter part to blind areas (blind spot) in the visual field even if some visual patterns are left unchanged. For example, the contours of a perceived object are spatially completed even if they overlap with the blind area. In humans, two associative or visual-psychic areas are located around the primary visual cortex, the parastriate area and the peristriate area. The electrical stimulation of the cells of these associative areas is found to generate the sensation of complex visual hallucinations corresponding to images of known objects or even sequences of significant actions. The lesion or surgical removal of areas of these visual-psychic areas does not cause blindness but prevents the maintenance of old visual experiences; moreover, it generates disturbances in perception in general, that is, the impossibility of combining individual impressions in complete structures and the inability to recognize complex objects or their pictorial representation. However, new visual learnings are possible, at least until the temporal lobe is removed (ablation). Subjects, with lesions in the visualpsychic areas, can describe single parts of an object and correctly reproduce the contour of the object but are unable to recognize the object as a whole. Other subjects cannot see more than one object at a time in the visual field. The connection between the visual-psychic areas of the two hemispheres is important for comparing the received retinal images of the primary visual cortex to allow 3D reconstruction of the objects.

4.6 Stereo Vision

375

Fig. 4.45 Map of the disparity resulting from the fusion of Julesz stereograms. a and b are the stereograms of left and right with different levels of disparity; c and d, respectively, show the pixels with different levels of disparity (representing four depth levels) and the red-blue anaglyph image generated by the random-dot stereo pair from which the corresponding depth levels can be observed with red-blue glasses

regions on each retina, can be assumed homologous, i.e., corresponding to the same physical part of the observed 3D object. Julesz demonstrated, using random-dot stereogram images, that this matching process applied to random-dot points is performing finding a large number of matches (homologous points in the two images) even under very noisy random-dot images. Figure 4.45 shows another example of random-dot stereogram with different central squares of different disparities perceiving a depth map at different heights. On the other hand, by using more complex images the matching process produces false targets, i.e., the search for homologous points in the two images fails [18].

376

4 Paradigms for 3D Vision

Fig. 4.46 Correspondence problem: ambiguity in finding homologous points in two images so that each point of the pair is the unique projection of the same 3D point

L

VL

PL LL

P

V

LR PR

VR

In these cases the observer can find himself in the situation represented in Fig. 4.46 where both eyes see three points, but the correspondence between the two retinal projections is ambiguous. In essence, we have the problem of correspondence, that is, how do you establish the true correspondence between the three points seen from the left retina with the right retina that are possible projections of the nine points present in the field of view? Nine candidate matches are plausible and the observer could see different depth planes corresponding to the perceived false targets. Only three matches are correct (colored squares), while the remaining six are generated by false targets (false counterparts), indicated with black squares. To solve the problem of ambiguous correspondences (basically ill-posed problem), Julesz suggested to consider in the process of correspondence global constraints. For example, to consider candidate points for correspondence, more complex structures than simple points, that is to say, traits of oriented contours or to consider some physical constraints of 3D objects, or to impose some constraints on the search modes in the two images of the homologous structures (example, the search for structures only on horizontal lines), or particular textures. This stereo vision process is called by Julesz as a global stereo vision, which is probably based on a more complex neural process to select local structures in images composed of elements with the same disparity. For the perception of more extended depth intervals the human visual system uses the movement of the eyes. The global stereo vision mechanism introduced by Julesz is not inspired by neurophysiology but uses the phenomenon of physics associated with the magnetic dipole, consisting of two point-like magnetic masses of equal value and opposite polarity, placed at a small distance from each other. This model also includes the hysteresis phenomenon. In fact, when Julesz’s stereograms are out of the observer, the disparity can be increased by twenty times the limit of Panum’s fusion area (range in which one can fuse two stereo images, normally 6 −18 of arc), without losing the feeling of stereoscopic vision.

4.6 Stereo Vision

377

In analogy to the magnetic dipole mechanism, fusion is based on the attraction it generates between opposite poles, once they come into contact, and it becomes difficult to separate them. The hysteresis phenomenon has influenced various models of stereo vision including the cooperativity among local measures to reach the global stereo vision.

4.6.6 Computational Model for Binocular Vision The estimate of the distance of an object from the observer, i.e., the perceived depth is determined in two phases: first, the disparity value is calculated (having solved the correspondence of the homologous points) and subsequently, this measurement is used, together with the geometry of the stereo system, to calculate the depth measurement. Following the Marr paradigm, these phases will have to include three levels for estimating disparity: the level of computational theory, the level of algorithms, and the level of algorithm implementation. Marr, Poggio, and Grimson [19] have developed all three levels inspired by human stereo vision. Several researchers subsequently applied some ideas of computational stereo vision models proposed by Marr-Poggio, to develop artificial stereo vision systems. In the previous paragraph, we examined what are the elements of uncertainty in the estimate of the disparity known as the correspondence problem. Any computational model chosen will have to minimize this problem, i.e., correctly search for homologous points in stereo images through a similarity measure that represents an estimate of how similar such homologous points (or structures) are. In the computational model of Marr and Poggio [20] are considered different constraints (based on physical considerations) to reduce as much as possible the problem of correspondence. These constraints are Compatibility. The homologous points of stereo images must have a very similar intrinsic physical structure if they represent the 2D projection of the same point (local area) of the visible surface of the 3D object. For example, in the case of random-dot stereograms, homologous candidate points are either black or white. Uniqueness. A given point on the visible surface of the 3D object has a unique position in space at any time (static objects). It follows that a point (or structure) in an image has only one homologous point in the other image, that is, it has only one candidate point as comparable: the constraint of uniqueness. Continuity. The disparity will vary slightly in any area of the stereo image. This constraint is motivated by the physical coherence of the visible surface in the sense that it assumes various continuities without abrupt discontinuities. This constraint is obviously violated in the areas of surface discontinuity of the object and in particular in the contour area of the object. Epipolarity. Homologous points must lie on the same line as the stereograms.

378

4 Paradigms for 3D Vision

To solve the problem of correspondence in stereo vision, by using these constraints, Marr-Poggio developed (following the Marr paradigm) two stereo vision algorithms: one based on cooperativity and the other based on the fusion of structures from coarse to fine (called coarse-to-fine control strategy). These algorithms have been tested with random-dot stereograms. The first algorithm uses a neural network to implement the three previous stereo vision constraints. When applied to random-dot stereograms these correspondence constraints must have compatibility (black points with black points, white points with white points), uniqueness (a single correspondence for each point), and continuity (values with constant disparity will be maintained or the disparity varies slightly in any area of a stereogram). The functional scheme of the neural network is a competitive one where there is only one unit for each possible disparity between homologous points of the stereograms. Each neural unit represents a point on the visible surface or a small surface element at a certain depth. Each excited neural unit can, in turn, excite or inhibit the neural activity of other units, and have it’s own increased or decreased activity, in turn, due to the excitement or inhibition it receives from the other units. Marr and Poggio have shown how the constraints of correspondence are satisfied in terms of excitation and inhibition in this neural model. The compatibility constraint implies that a neural unit will initially be active if it is excited by similar structures from both stereograms, for example, in the case of homologous points corresponding to black points or points both white the uniqueness constraint is incorporated through the inhibition which it proceeds between units that fall along the same line of sight, i.e., units that represent different disparity values for the same structure (homologous structures) inhibit each other. The continuity constraint is incorporated having the excitation that proceeds between units that represent different structures (nonhomologous structures) at the same disparity. It follows that the network operates so that unique comparisons that maintain structures of the same disparity are favored. Figure 4.47 shows the results of this cooperative algorithm. The stereograms that are input to the neural network are shown, which presents an initial state indicated with 0 including all possible comparisons between the predefined disparity intervals. The algorithm performs different iterations producing various maps of ever more precise depths that highlight, with different levels of gray, the structures with different disparity values. The activity of the neural network ends when it reaches a stable state, i.e., when the neural activity of each unit is no longer modified in the last iterations. This algorithm provides a clear example to understand the nature of human vision even if it does not fully explain the mechanisms of human stereo vision. The second algorithm of Marr and Poggio proposes to attenuate the number of false targets or points observed as homologous, but belonging to different physical points of the object. The false targets vary in relation to the number of points present in the images, considered candidates for comparison, and in relation to the interval of disparity within which it is plausible to verify the correspondence of the points. Another element that induces false targets is the different contrast that can exist between stereo images. From these considerations emerges the need to reduce as

4.6 Stereo Vision

379

Fig. 4.47 Results of the Marr-Poggio stereo cooperative algorithm. The initial state of the network, which includes all possible match within a predefined disparity range, is indicated with the map 0. With the evolution of iterations, the geometric structure present in the random-dot stereograms emerges, and the different disparity values are represented with gray levels

380

4 Paradigms for 3D Vision

much as possible the number of correspondences and to become invariant from the possible different contrast of the stereograms. The functional scheme of this second algorithm is the following: 1. Each stereo image is analyzed with different spatial resolution channels and the comparison is made between the elementary structures present in the images associated with the same channel and for disparity values that depend on the resolution of the channel considered. 2. We can use the disparity estimates calculated with coarse resolution channels to guide the ocular movements of vergence of the eyes to align the elementary structures by comparing the channel disparities with a finer resolution to find the correct match. 3. Once the correspondence is determined, associated with a certain resolution, the disparity values are maintained in a map of disparity (2.5D sketch) as memory buffers (the function of this memory suggests to Marr-Poggio to attribute the phenomenon of hysteresis). The stereo vision process with this algorithm begins by analyzing the images with coarse resolution channels that generate elementary structures well separated from each other, and then the matching process is guided to corresponding channels with finer resolution, thus improving the robustness in determining the homologous points. The novelty in this second algorithm consists in selecting in the two images, as elementary structures for comparison, contour points, and in particular, Marr and Poggio used the zero crossing, characterized by the sign of variation of the contrast and their orientation local. Grimson [19] has implemented this algorithm using random-dot stereograms at 50% density. The stereograms are convolved with the LoG filter (Laplacian of Gaussian), with different values of σ to produce multi-channel √ stereograms with different spatial resolution. Remember the relation W = 2×3 2σ , introduced in Sect. 1.13 Vol. II, which links the W dimension of the convolution mask with σ which controls the smoothing effect of the LOG filter. In Fig. 4.48 the three convolutions of the stereograms obtained with square masks of different sizes of 35, 17 and 9 pixels, respectively, are shown. The zero crossings obtained from the convolutions are shown and it is observed how the structures become more and more detailed as the convolution filter is smaller. Points of zero crossing are considered homologous in the two images, if they have the same sign and their local orientation remains within an angular difference not exceeding 30◦ . The comparison activity of the zero crossing starts with the coarse channels and the resulting disparity map is very coarse. Starting from this map of rough disparity, the comparison process analyzes the images convolved with the medium-sized filter and the resulting disparity map is more detailed and precise. The process continues using this intermediate disparity map that guides the process of comparing the zero

4.6 Stereo Vision

381

Left image

Right image

Zero crossing

W=35

W=17

W=9

Fig.4.48 Zero crossing obtained through the multiscale LoG filtering applied to the pair of randomdot stereo images of Fig. 4.45 (first column). The other columns show the results of the filtering performed at different scales by applying the convolution mask of the LOG filter, of the square shape, respectively, of sizes 35, 17 and 9

crossing in the last channel, obtaining a final disparity map with the finest resolution and the highest density. The compatibility constraint is satisfied considering the zero crossing structures as candidates for comparison to those that have the same sign and local orientation. The larger filter produces few candidates for zero crossing due to the increased smoothing activity of the Gaussian component of the filter and only the structures with strong variations in intensity are maintained (coarse channel). In these conditions, the comparison concerning the zero crossing structures, which in stereo images are within a predefined disparity interval (Panum fusion interval, which depends on the binocular system used) and the W width of the filter used, which we know depends on the parameter σ for the LOG filter. The constraints of the comparison, together with the quantitative relationship that links these latter parameters (filter width and default range of disparities), allow to optimize the number of positive comparisons between the homologous zero crossing structures, reducing false positives (can not to detect homologous zero crossing structures that exist instead) and false negatives (homologous zero crossing structures can be detected instead that they should not exist), for example, a homologous zero crossing is found in the other image generated by the noise, or the homologous point in the other image is not visible because it is occluded and instead another nonhomologous zero crossing is chosen (see Fig. 4.49). Once the candidate homologous points are found in the pair of images with low spatial resolution channels, these constitute the constraints to search for zero crossing structures in the images filtered at higher resolution and the search for zero crossing

382

4 Paradigms for 3D Vision

(a)

(b)

L

Left

(c)

L

-W

R

F

+W

R

F

Right

d

d -W/2

d +W/2

-W

+W

-W

+W

Fig.4.49 Scheme of the matching process to find zero crossing homologous points in stereo images. a One zero crossing L in the left image has a high probability of finding the homologous R with disparity d in the right image if d < W/2. b Another possible configuration is to find a counterpart in the whole range W or a false counterpart F with 50% probability, but it always remains to find a R homologous. c To disambiguate the false homologues, the comparison between the zero crossing is realized first from the left image to the right one and viceversa obtaining that L2 can have as a homologous R2 , while R1 has as a homologous L1

homologues must take place within a range of disparity which is twice the size of the current filter W . Given a zero crossing L in a given position in the left image (see Fig. 4.49a) and another R in the right image which is homologous to L (or has the same sign and orientation) with a disparity value d . With F a possible false match is indicated instead near R. From the statistical analysis, it is shown that R, is the counterpart of L, within a range ±W/2 with the probability of 95% if the maximum disparity is d = W/2. In other words, given a zero crossing in some position in the filtered image, it has been shown that the probability of the existence of another zero crossing in the range ±W/2 is 5%. If the correct disparity is not in the range ±W/2, the probability of a match is 40%. For d > W/2 and d ≤ W we have the same probability of 95% that R is the only homologous candidate for L with disparity from 0 to W if the value of d is positive (see Fig. 4.49b). The probability of 50% is also statistically determined, of a false correspondence of 2W of disparity in the interval between d = −W and d = W . This means that the 50% of the times there is ambiguity in determining the correct match, both in the disparity interval (0−W ) (convergent disparity) and in the interval (−W − 0) (divergent disparity), where only one of the two cases is correct. If d is around zero, the probability of a correct match is 90%. Therefore, from the figure we have that F is a false match candidate, with the probability of 50%, but it also turns out the possible match with R. To determine the correct match, Grimson proposed the match procedure that first compares the zero crossing from the left to the right image, and then from the right and left image (see Fig. 4.49c). In this case, starting the comparison from the left to right image, L1 can ambiguously correspond to R1 or to R2 , but L2 has only R2 as its counterpart. From the right hand side, the correspondence is unique for R1 which is only L1 , but is ambiguous for R2 . Combining the two situations together, the two unique matches provide the correct solution (constraint of uniqueness). It is shown that if more than 70% of the zero crossings matches in the range (−W, +W ) then the disparity interval is correct

4.6 Stereo Vision Fig. 4.50 Results of the second Marr-Poggio algorithm applied to the random-dot stereograms of Fig. 4.48 with 4 levels of depth

383

Map of disparity

(satisfies the continuity constraint). Figure 4.50 shows the results of this algorithm applied to the random-dot stereograms of Fig. 4.48 with four levels of depth. As previously indicated, the disparity values, found in the intermediate steps of this coarse-to-fine comparison process, are saved in a temporary memory buffer also called 2.5D sketch map. The function of this temporary memory of the correspondence process is considered for Marr-Poggio to be the equivalent of the hysteresis phenomenon initially proposed by Julesz to explain the biological fusion process. This algorithm does not fully understand all the psycho-biological evidences of human vision. This computational model of stereo vision has been revised by other researchers, and others have proposed different computational modules where the matching process is seen as integrated with the extraction process of candidate elementary structures for comparison (primal sketch). These latest computational models contrast with the idea of Marr’s stereo vision, which sees the early vision modules for the extraction of elementary structures separated.

4.6.7 Simple Artificial Binocular System Figure 4.51 shows a simplified diagram of a monocular vision system and we can see how the 3D scene is projected onto the 2D image plane essentially reducing the original information of the scene by one dimension. This loss of information is caused by the perspective nature of the projection that makes ambiguous in the image plane, the apparent dimension of a geometric structure of an object. In fact, it appears of the same size, regardless of whether it is near or further away from the capture system. This ambiguity depends on the inability of the monocular system to recover the information lost with the perspective projection process. To solve this problem, in an analogy to human binocular vision, the artificial binocular vision scheme, shown in Fig. 4.52, consists of two cameras located in slightly different positions along the X -axis, is proposed. The acquired images are called stereoscopic image pairs or stereograms. A stereoscopic vision system produces a depth map that is the distance

384

4 Paradigms for 3D Vision

N

Object

Light source

Im

ag

eP

lan

e

Optical System

Fig. 4.51 Simplified scheme of a monocular vision system. Each pixel in the image plane captures the light energy (irradiance) reflected by a surface element of the object, in relation to the orientation of this surface element and the characteristics of the system

(a)

(b)

Z

Y

Z

P(X,Y,Z)

b

c’L

PR

xL yR xR

xR

c’R

b

0

cR

yR

f

cR

cL

X

PL

c’L

Left image

X

f

xL

cL

0

ZP

D

yL

f

yL

P(X,Y,Z)

N

Epipolar line

PL

N

M

M

xL

xL

PR c’R Right image

xR

xR

Fig. 4.52 Simplified diagram of a binocular vision system with parallel and coplanar optical axes. a 3D representation with the Z axis parallel to the optical axes; b Projection in the plane X −Z of the binocular system

4.6 Stereo Vision

385

between the cameras and the visible points of the scene projected in the stereo image planes. The gray level of each pixel of the stereo images is related to the light energy reflected by the visible surface projected in the image plane, as shown in Fig. 4.51. With binocular vision, part of the 3D information of the visible scene is recovered through the gray level information of the pixels and through the triangulation process that uses the disparity value for depth estimation. Before proceeding to the formal calculation of the depth, we analyze some geometric notations of stereometry. Figure 4.52 shows the simplest geometric model of a binocular system, consisting of two cameras arranged with separate parallel and coplanar optical axes of a value b, called distance baseline, in the direction of X axis. In this geometry, the two image planes are also coplanar at the focal distance f with respect to the optical center of the left lens which is the origin of the stereo system. A P element of the visible surface is projected by the two lenses on their respective retinas in PL and in PR . The plane passing through the optical centers CL and CR of the lenses and the visible surface element P is called epipolar plane. The intersection of the epipolar plane with the plane of the retinas defines the epipolar line. The Z axis coincides with the optical axis of the left camera. Stereo images are also vertically aligned and this implies that each element P of the visible surface is projected onto the two retinas maintaining the same vertical coordinate Y . The constraint of the epipolar line implies that the stereo system does not present any vertical disparity. Two points found in the two retinas along the same vertical coordinate are called homologous points if they derive from the perspective projection of the same element of the visible surface P. The disparity measure is obtained by superimposing the two retinas and calculating the horizontal distance of the two homologous points.

4.6.7.1 Depth Calculation With reference to Fig. 4.52a, let P be the visible 3D surface element of coordinates (X , Y , Z) considering the origin of the stereo system coinciding with the optical center of the left camera. From the comparison of similar triangles MPCL and CL PL CL in the plane XZ we can observe that the radius passing through P crossing the center of the lens CL intersects the plane of the retinas Z = −f in the point PL of the left retina whose horizontal coordinate XL is obtained from the following relation: X −XL = (4.1) Z f from which f (4.2) XL = −X Z Similarly, considering the vertical plane YZ, from the comparison of similar triangles, the vertical coordinate YL is given for the same point P projected on the left retina in PL , given by f YL = −Y (4.3) Z

386

4 Paradigms for 3D Vision

Similar equations are obtained by considering the comparison of similar triangles PNCR and CR PR CR in the plane XZ where the radius passing through P crossing the center of the lens CR intersects the plane of the right retina in the point PR , whose horizontal and vertical coordinates are given considering the following relation: X −b b − XR = Z f

(4.4)

from which

X −b f (4.5) Z where we remember that b is the baseline (separation distance of the optical axes). In the vertical plane YZ, in a similar way, the vertical coordinate yR is calculated XR = b −

f (4.6) Z From the geometry of the binocular system (see Fig. 4.52b) we observe that the depth of P coincides with the value Z, that is, the distance of P from the plane passing through the two camera optical centers and parallel to the image plane.7 To simplify the derivation of the equation calculating the value of Z for each point P of the scene, it is convenient to consider the coordinates of the projection of P in the local reference systems to the respective left and right image planes. These new reference systems have the origin in the center of each retina with respect to which the coordinates of the projections of P are referred by operating a simple translation from the global coordinates X to the local ones xL and xR , respectively, for the left retina and the right retina. It follows that these local coordinates xL and xR having the origin, with respect to the center of the relative retinas, can also assume negative values together with the global coordinate X , while Z would always be positive. In Fig. 4.52b the global axis of the Y and the local coordinate axes yR and yL , have not been reported because they are perpendicular and exiting from the page. Considering the new local coordinates for each retina, and the new relations derived from the same similar rectangular triangles (MPCL and CL PL CL for the left retina, and NPCR and CR PR CR for the right retina), the following relationships are found: −xR −xL X X −b = = (4.7) f Z f Z YR = −Y

from which it is possible to derive (similarly for the coordinates yL and yR ), the new equations for the calculation of the horizontal coordinates xL and xR xL = −

7 In

X f Z

xR = −

X −b f Z

(4.8)

the Fig. 4.52b the depth of the point P is indicated with ZP but in the text, we will continue to indicate with Z the depth of a generic point of the object.

4.6 Stereo Vision

387

By eliminating the X from the (4.8), and resolving with respect to Z, we get the following relation: b·f (4.9) Z= xR − xL which is the triangulation equation for calculating the perpendicular distance (depth) for a binocular system with the geometry defined in Fig. 4.52a, that is, with the constraints of the parallel optical axes and with the projections PL and PR lying on the epipolar line. We can see how in the (4.9) the distance Z is correlated only to the disparity value (xR − xL ), induced by the observation of the point P of the scene, and is independent of the system of reference of the local coordinates or, from the absolute values of xR and xL . Recall that the parameter b is the baseline, that is, the separation distance of the optical axes of the two cameras and f is the focal length, identical for the optics of the two cameras. Furthermore, b and f have positive values. In the (4.9) the value of Z must be positive and consequently the denominator is xR ≤ xL . The geometry of the human binocular vision system is such that the b·f numerator of (4.9) assumes values in the range (390 − 1105 mm) considering the interval (6 − 17 mm) of the focal length of the crystalline lens (corresponding, respectively, to the vision of objects closer, at about 25 cm, with contracted ciliary muscle, and at the sight of more distant objects, at about 25 cm, with relaxed ciliary muscle) and the baseline b = 65 mm. Associated with the corresponding range of Z = 0, 25−100 m, the interval of disparity xR − xL would result (2 − 0, 0039 mm). The denominator value of (4.9) tends to assume very small values to calculate large values of depth Z (for (xR − xL ) → 0 ⇒ Z → ∞). This can determine a non-negligible uncertainty in the estimate of Z. For a binocular vision system, the uncertainty of the estimate of Z can be limited with the use of cameras with a good spatial resolution (not less than 512 × 512 pixel) and to minimize the error in the estimation of the position of the elementary structures detected in stereo images, candidate as homologous structures. These two aspects can easily be solved by considering the availability of HD cameras (resolution 1920 × 1080 pixel) equipped with chips with photoreceptors of 4µm. For example, a pair of these cameras, configured with a baseline of b = 120 mm and optics with focal lengths of 15 mm, to detect an object at a distance of 10 m, for the (4.9) the corresponding disparity would be 0.18 mm which in terms of pixels would correspond to several tens (adequate resolution to evaluate the position of homologous structures in the two HD stereo images). Let us return to Fig. 4.52 and use the same similar right angled triangles in the 3D context. PL CL and CL P are the hypotenuses of this similar right angled triangles. We can get the following expression:  f 2 + xL2 + yL2 D PL CL D = =⇒ = (4.10) Z f Z f

388

4 Paradigms for 3D Vision

to which Z can be replaced with the Eq. (4.9) (perpendicular distance of P) obtaining  b f 2 + xL2 + yL2 (4.11) D= xR − xL which is the equation of the Euclidean distance D of the point P in the threedimensional reference system, whose origin always coincides with the optical center CL of the left camera. When calibrating the binocular vision system, if it is necessary to verify the spatial resolution of the system, it may be convenient to use the Eq. (4.9) or the (4.11) to predict, note the positions of points P in space and position xL in the left retina, what should be the value of the disparity, i.e., estimate xR , and the position of the point P when projected in the right retina. In some applications it is important to evaluate well the constant b · f (of the Eq. 4.9) linked to the intrinsic parameter of the focal length f of the lens and to the extrinsic parameter b that depends on the geometry of the system.

4.6.7.2 Intrinsic and Extrinsic Parameters of a Binocular System The intrinsic parameters of a digital acquisition system, characterize the optics of the system (for example, the focal length and optical center), the geometry of the optical system (for example, the radial geometric distortions introduced), and the geometric resolution of the image area that depends on the digitization process and on the transformation of the plane-image coordinates to the pixel-image coordinates (described in Chap. 5 Vol. I Digitization and Image Display). The extrinsic parameters of an acquisition system are instead the parameters that define the structure (position and orientation) of the cameras (or in general of optical systems) with respect to a 3D external reference system. In essence, the extrinsic parameters describe the geometric transformation (for example, translation, rotation or roto-translation), which relate the coordinates of known points in the world and the coordinates of the same points with respect to the acquisition system (unknown reference). For the considered binocular system, the baseline b constitutes an extrinsic parameter. This activity of estimating the intrinsic and extrinsic parameters, known as calibration of the binocular vision system, consists of first identifying some known P points (in the 3D world) in the two retinas and then evaluating the disparity value xR − xL e the distance Z of these known points using other systems. Solving with respect to b or f with the Eq. (4.9) it is possible to verify the correctness of these parameters. From the analysis of the calibration results of the binocular system considered, it can be observed that the increase of the baseline value b can better influence the accuracy in the estimate of Z with the consequent increase of the disparity values. A good compromise must be chosen between the value of b and the width of the visible area of the scene seen by both cameras (see Fig. 4.53a). In essence, increasing the baseline decreases the number of observable points in the scene. Another aspect that must be considered is the diversity of the acquired pair of stereo images, due to the distortion introduced and the perspective projection. This difference in stereo

4.6 Stereo Vision

389

(b)

(a) Uncertainty of P Field of view

Field of view

P

Uncertainty of P

b

PL

c’L Left image

1 pixel

cR f

f

cL

c’R

PR

Right image

Fig. 4.53 Field of view in binocular vision. a In systems with parallel optical axes, the field of view decreases with the increase of baseline but a consequent increase in the accuracy is obtained in determining the depth. b In systems with converging optical axes, the field of view decreases as the vergence angle and baseline increase but decreases the level of depth uncertainty

images increases with the increase of b, all to the disadvantage of the stereo fusion process which aims to search, in stereo images, for the homologous points deriving from the same point P of the scene. Proper calibration is strategic when the vision system has to interact with the world to reconstruct the 3D model of the scene and when it has to refer to it (for example, an autonomous vehicle or robotic arm must self-locate with an accuracy). Some calibration methods are well described in [21–24]. According to Fig. 4.52, the equations for reconstructing the 3D coordinates of each point P(X , Y , Z) visible from the binocular system (with parallel and coplanar optical axes) are summarized Z=

b·f xR − xL

X = xL

Z xL · b = f xR − xL

Y = yL

Z yL · b = f xR − xL

(4.12)

4.6.8 General Binocular System To mitigate the limit situations indicated above, with the stereo geometry proposed in Fig. 4.52 (to be used when Z b), it is possible to use a different configuration of the cameras arranging them with the convergent optical axes, that is, inclined toward the fixation point P which is at a finished distance from the stereo system, as shown in Fig. 4.54. With this geometry the points of the scene projected on the two retinas lie along the lines of intersection (the epipolar lines) between the image planes and the epipolar plane which includes the point P of the scene and the two centers optical CL and CR of the two cameras, as shown in Fig. 4.54a. It is evident that with this geometry, the epipolar lines are no longer horizontal as they were with the previous stereo geometry,

390

4 Paradigms for 3D Vision

(a)

PL cL

(b)

P

lL

Baseline

Epipolar Plane Epipolar Lines Epipoles

P PR

PL

lR cR

cL

PR

lL eL

eR

lR

cR

Fig. 4.54 Binocular system with converging optical axes. a Epipolar geometry: the baseline line intersects each image plane to the epipoles eL and eR . Any plan containing the baseline is called epipolar plane and intersects the image planes at the epipolar lines lL and lR . In the figure, the epipolar plane considered is the one passing through the fixation point P. As the 3D position of P changes, the epipolar plane rotates around the baseline and all the epipolar lines pass through the epipoles. b The epipolarity constraint imposes the coplanarity in the epipolar plane of the point P of the 3D space, of the projections PL and PR of P in the respective image planes, and of the two optical centers CL and CR . It follows that a point of the image of the left PL through the center CL is projected backward in the 3D space from the radius CL PL . The image of this ray is projected in the image on the right and corresponds to the epipolar line lR where to search for PR , that is, the homologue of PL

with the optical axes of the cameras arranged parallel and coplanar. Furthermore, the epipolar lines intersect the epipolar plane always in corresponding pairs. The potential homologous points of P projections in the two retinas, respectively, in PL and PR , lie on the corresponding epipolar lines lL and lR for the epipolarity constraint. The baseline b is always the line joining the optical centers and the epipoles eL and eR of the optical systems are the intersection points of the baseline with the respective image planes. The right epipole eR is the virtual image of the left optical center CL observed in the right image, and vice versa the left epipole eL is the virtual image of the optical center CR . Known the intrinsic and extrinsic parameters of the binocular system (calibrated) and the epipolarity constraints, the correspondence problem is simplified by restricting the search for the homologous point of PL (supposedly known) on the associated epipolar line lR , coplanar to the plane epipolar determined by PL , CL , and the baseline (see Fig. 4.54b). Therefore, the search is restricted on the epipolar line lR and not on the entire image on the right. For a binocular system with converging optical axes, the binocular triangulation method (also called binocular parallax) can be used, for the calculation of the coordinates of P, but the previous Eq. (4.12), is no longer valid, having substantially assumed a binocular system with point of fixation at infinity (parallel optical axes). In fact, in this geometry, instead of calculating the linear disparity, it is necessary to calculate the angular disparities θL and θR , which depend on the angle of convergence ω of the optical axes of the system (see Fig. 4.55). In analogy to the human vision, the optical axes of the two cameras intersect at a point F of the scene (fixation point) at the perpendicular distance Z from the

4.6 Stereo Vision

(a)

391

P

P

(b) ΔZ

β

(c)

β P

F β/2

α

θR Z

θL

F

β/2 β/2

α/2

θL F θR α α/2

α/2

θL

Δψ

Δω

Z

α/2 ω

b b/2

ΔZ

β/2

A

b/2 θR

Δω

ψ

b

B Δψ

Fig. 4.55 Calculation of the angular disparity

baseline. We know that with the stereopsis (see Sect. 4.6.3) we get the perception of the relative depth if simultaneously another point P is seen near or farther with respect to F (see Fig. 4.55a). In particular, we know that all points located around the horopter stimulate the stereopsis caused by the retinal disparity (local difference between the retinal images caused by the different observation point of each eye). The disparity in the point F is zero, while there is an angular disparity for all points outside the horopter curve and each presents a different angle of vergence β. Analyzing the geometry of the binocular system of Fig. 4.55a it is possible to derive a binocular disparity in terms of angular disparity δ, defined as follows: δ = α − β = θR − θL

(4.13)

where α and β are the angles of vergence underlying the fixation point F from the optical axes and the point P outside the horopter curve; θL and θR are the angles included in the retinal projections, in the left and right camera, of a fixation point F and the target point P. The functional relationship that binds the angular disparity δ (expressed in radians) and the depth Z is obtained by applying the elementary geometry (see Fig. 4.55b). Considering the right angled triangles with base b/2, where b is the baseline, it results from the trigonometry 2b = tan(α/2)Z. For small angles you can approximate tan(α/2) ≈ α/2. Therefore, the angular disparity between PL and PR is obtained by applying the (4.13) as follows: δ =α−β =

b b b · Z − = 2 Z Z + · Z Z + Z Z

(4.14)

For very small distances of Z (less than 1 meter) of the fixation point and with values of depth very small compared to Z, the second term in the denominator of the (4.14) becomes almost zero and the expression of angular disparity is simplified b Z (4.15) Z2 To the same result of the angular disparity is achieved if we consider the differences between the angles θL and θR treating in this case the sign of the angles. δ=

392

4 Paradigms for 3D Vision

The estimate of the depth Z and of the absolute distance Z of an object with respect to a known reference object (fixation point), where the binocular system converges with the ω and ψ angles, can be calculated considering different angular reference coordinates, as shown in Fig. 4.55c. Note the angular configuration ω and ψ with respect to the reference object F, the absolute distance Z of F with respect to the baseline b is given by8 : Z ≈b

sin ω sin ψ sin(ω + ψ)

(4.16)

while the depth Z results

ω + ψ (4.17) 2 where ω and ψ are the angular offsets of the left and right image planes to align the binocular system with P starting from the initial reference configuration. In human vision, considering the baseline b = 0.065 m and fixing an object at the distance Z = 1.2 m, for a distant object Z = 0.1 m from the one fixed, applying the (4.15) would have an angular disparity of δ = 0.0045 rad. A person with normal vision is able to pass a wire through the eye of a needle fixed at Z = 0.35 m and working around the eyelet with the resolution of Z = 0.1 mm. The visual capacity of the human stereopsis is such as to perceive depth, around the point of fixation, of fractions of millimeters requiring, according to the (4.15), a resolution of the angular disparity δ = 0.000053 rad = 10.9 s of arc. Fixing an object at the distance of Z = 400 m the depth around this object is no longer perceptible (perceived flattened background) as the resolution of the required angular disparity would be very small, less than 1 s of arc. In systems with converging optical axes, the field of view decreases with increasing vergence angle and baseline but decreases the level of depth uncertainty (see Fig. 4.53b). Active and passive vision systems, have been experimented to estimate the angular disparity together with other parameters (position and orientation of the cameras) of the system. These parameters are evaluated and dynamically checked for the calculation of the depth of various points in the scene. The estimation of the position and orientation of the cameras requires their calibration (see Chap. 7 of Camera calibration and 3D Reconstruction). If positions and orientations of the cameras are known (calculated, for example, with active systems) the reconstruction of the 3D points of the scene is realized with the roto-translation transformation of the points PL = (xL , yL , zL ) ( projection of P(X , Y , Z) in the left retina) which are projected Z = b

Fig. 4.55c it is observed that Z = AF · sin ω, where AF is calculated remembering the b AF theorem of the sines for which sin(π −ω−ψ) = sin ψ and that the sum of the inner angles of the acute

8 From

triangle AFB is π . Therefore, resolving with respect to AF and replacing, it is obtained Z =b

sin ψ sin ω sin ψ sin ω = b . sin[π − (ω + ψ)] sin(ω + ψ)

4.6 Stereo Vision

393

in PR = (xR , yR , zR ), in the right retina, with the transformation PR = RPL + T where R and T are, respectively, the rotation matrix of size 3 × 3 and the translation vector, to switch from the left to the right camera. The calibration procedure has previously calculated R and T knowing the coordinates PL and PR , in the reference system of cameras that correspond to the same points as the 3D scene (at least 5 points). In the Chap. 7 we will return in detail on the various methods of calibration of monocular and stereo vision systems.

4.7 Stereo Vision Algorithms In the preceding paragraphs, we have described the algorithms proposed by Julesz and Marr-Poggio inspired by biological binocular vision. In this paragraph we will take up the basic concepts of stereo vision, to create an artificial vision system capable of adequately reconstructing the 3D surface of the observed scene. The stereo vision problem is essentially reduced to the identification of the elementary structures in the pair of stereo images and to the identification of the pairs of homologous elementary structures, that is, points of the stereo images, which are the projection of the same point P of the 3D scene. The identification in the stereo pair of homologous points is also known in the literature as the problem of correspondence: for each point in the left image, find its corresponding point in the right image. The identification of homologous points in the pair of stereo images depends on two strategies: 1. what elementary structures of stereo images must be chosen as candidate homologous structures; 2. what measure of similarity (or dissimilarity) must be chosen to measure the level of dependency (or independency) between the structures identified by strategy 1. For strategy 1 the use of contour points or elementary areas have already been proposed. Recently, the Points Of Interest-POI described in Chap. 6 Vol. II (for example, SIFT and SUSAN) are also used. For strategy 2, two classes of algorithms are obtained for the measurement of similarity (or dissimilarity) for point elementary structures or extended areas. We immediately highlight the importance of similarity (or dissimilarity) algorithms of structures that must well discriminate between structures that are not very different from each other (homogeneous distribution of gray levels of pixels). This is to keep the number of false matches to a minimum. The calculation of the depth is made only for the elementary structures found in the images and in particular by choosing only the homologous structures (punctual or areas). For all the other structures (features) for which the depth cannot be

394

4 Paradigms for 3D Vision

calculated with stereo vision, interpolation techniques are used to have a more complete reconstruction of the visible 3D surface. The search for homologous structures (strategy 2) is simplified when the geometry of the binocular system is conditioned by the constraint of the epipolarity. With this constraint, the homologous structures are located along the corresponding epipolar lines and the search area in the left and right image is limited. The extent of the research area depends on the uncertainty of the intrinsic and extrinsic parameters of the binocular system (for example, uncertainty about the position and orientation of the cameras), making it necessary to search for the homologous structure in a small neighborhood with respect to the estimated position of the structure in the image on the right (note the geometry of the system) slightly violating the constraint of the epipolarity (search in a horizontal and/or vertical neighborhood). In the simple binocular system, with parallel and coplanar optical axes, or in the case of rectified stereo images, the search for homologous structures takes place by considering corresponding lines with the same vertical coordinate.

4.7.1 Point-Like Elementary Structures The extraction of the elementary structures present in the pair of stereo images can be done by applying to these images some filtering operators described in Chap. 1 Vol. II Local Operations: Edging. In particular, point-like structures, contours, edge elements, and corners can be extracted. Julesz and Marr used the random-dot images (black or white point-like synthetic structures) and the corresponding structures of the zero crossing extracted with the LOG filtering operator. With the constraint of epipolar geometry, a stereo vision algorithm includes the following essential steps: 1. Acquisition of the pair of stereo images; 2. Apply to the two images a Gaussian filter (with the appropriate parameter σ ) to attenuate the noise; 3. Apply to the two images a point-like structure extraction operator (edges, contours, points, etc.); 4. Apply strategy 2 (see previous paragraph) for the search for homologous structures by analyzing those extracted in point 3. With the constraint of epipolar geometry, the search for homologous points is done by analyzing the two epipolar lines in the two images. To minimize the uncertainty in the evaluation of similarity (or dissimilarity) of structures, these can be described through different features such as, for example, the orientation θ (horizontal borders are excluded), the value of the contrast M , the length l of the structure, the coordinates of the center of the structure (xL , yL ) in the left image and (xR , yR ) in the right one. For any pair of elementary structures, whose n features are represented by the components of the respective vectors si = (s1i , s2i , . . . , sni ) and sj = (s1j , s2j , . . . , snj ), one of the

4.7 Stereo Vision Algorithms

395

following similarity measures can be used SD = si , sj  = siT sj = s|  sj  cos φ  ski skj = s1i s1j + s2i s2j + · · · + ski skj = k

SE = si − sj  =

  

2 ski − skj wk

(4.18)

(4.19)

k

where SD represents the inner product between the vectors si and sj which indicate two generic elementary structures characterized by n parameters, φ is the angle between the two vectors and • indicates the length of the vector. Instead, SE represents the Euclidean distance weighted by the weights wk of each characteristic sk of the elementary structures. The two similarity measures SD and SE can be normalized to not depend very much on the variability of the characteristics of the elementary structures. In this case, the (4.18) can be normalized by dividing each term of the summation by the product of the vector modules si and sj , given by        2  2 ski skj si sj  = k

k

The weighted Euclidean distance measure can be normalized by dividing each addend of the (4.19) by the term R2k which represents the maximum range of variability of the component k-th squared high. The Euclidean distance measure is used as a similarity estimate in the sense that the more different are the components that describe the pairs of candidate structures as homologous, the greater their difference, that is, the value of SD . The weights wk , relative to each characteristic sk , are calculated by analyzing a certain number of pairs of elementary structures for which we can guarantee that they are homologous structures. An alternative normalization can be chosen on a statistical basis by considering each characteristic sk having subtracted its average sk  and dividing by its standard deviation. Obviously, this is possible for both measures SD and SE , only if the probability distribution of the characteristics is known which can be estimated using known pairs {(si , sj ), . . .} of elementary structures. In conclusion, in this step, the measure of similarity of pairs (si , sj ) is estimated to check whether they are homologous structures and then, for each pair, the measure of disparity dij = si (xR ) − sj (xL ) is calculated. 5. The previous steps can be repeated to have different disparity estimates by identifying and calculating the correspondence of the structures at different scales, analogous to the coarse-to-fine approach proposed by Marr-Poggio, described earlier in this chapter. 6. Calculation with Eq. (4.12) of 3D spatial coordinates (depth Z and coordinates X and Y ), for each point of the visible surface represented by the pair of homologous structures.

396

4 Paradigms for 3D Vision

7. Reconstruction of the visible surface, at the points where it was not possible to measure the depth, through an interpolation process, using the measurements of the stereo system estimated in step 6.

4.7.2 Local Elementary Structures and Correspondence Calculation Methods The choice of local elementary structures is based on small image windows (3 × 3, 5 × 5, . . .), characterized as feature, which through a measure of similarity based on correlation, we can check if it exists, in the other stereo image, a window of the same size that is candidate to be considered a homologous structure (feature). In this case, comparing image windows, the potential homologous structures are the selected windows that give the maximum correlation, i.e., the highest value of similarity. Figure 4.56 shows a stereo pair of images with some windows found with a high correlation value, as they represent the projection of a same portion of 3D surface of the observed scene. In the first horizontal line of the figure, we assume to find the homologous windows centered on the same row of the stereo images for the constraint of the epipolarity. The identification by correlation of the local elementary structures eliminates the drawback that one had with the method that uses the pointlike structures, which generate uncertain depth measurements in the areas of strong discontinuity in the gray levels due to occlusion problems. Moreover in the areas with a visible curved surface, the point-like structures of the stereo pairs of images are difficult to identify and when they are, they do not always coincide with the same physical structure. The identification of point-like structures (edges, contours, edges, etc.) is well defined in the images right in the occlusion zones. Without the epipolarity constraint, we can look for local structures in the other image in a R search area whose size depends on the size of the local structures and the geometry of the stereo system. Figure 4.56 shows this situation in another row of stereo images. Suppose we have identified a local feature represented by the square window WL (k, l), k, l = −M , +M of size (2M + 1), centered in (xL , yL ), in the left image IL (xL , yL ). We search horizontally in the right image IR a window WR of the same size as WL , located in the search area Rm,n (i, j), i, j = −N , +N of size (2N + 1), initially at position m = xL + dminx and n = yL + dminy where, dminx and dminy are, respectively, the minimum expected horizontal and vertical disparity (dminy is zero in the hypothesis of epipolar geometry), known a priori from the geometry of the binocular system. With the constraint of the epipolarity we would have that the search area R would coincide with the window WR where N = M (identical dimensions) and the correspondence in the image on the right would result in xR = xL + dx (the disparity is added because the right image is assumed to be shifted to the right of the left one, as shown in Fig. 4.56, otherwise the disparity dx would be subtracted.).

4.7 Stereo Vision Algorithms

397

x

x xL

xR

y

y yL

yL WL

WR(i+m,j+n)

WR

xR=i+m Rmxn

xL

WR

n

yL

yR=j+n i

WL Rmxn

j (m,n)

Left image IL

Right image IR

m

Fig. 4.56 Search in stereo images of homologous points through the correlation function between potential homologous windows, with and without epipolarity constraint

4.7.2.1 Correlation-Based Matching Function The goal is to move the window WR into R which, according to a matching measure9 C reaches the maximum value when the two windows are similar, i.e., they represent the homologous local structures. This matching measure can be the cross-correlation function or other functions of similarity/dissimilarity measures (e.g., sum of the differences). Without the constraint of epipolarity, similar local structures to be searched for in stereo images are not aligned horizontally. Therefore, we must find the window WR , the homologue of the WL of the image on the left, in the search area R centered in (m, n) in the image on the right. The dimensions of R must be adequate as indicated in the previous paragraph (see Fig. 4.56). Let (i, j) be the position in the search area R where the window WR is centered in which the correlation C is maximum. WR in the image IR will result in the posi-

9 In the context of image processing, it is often necessary to compare the level of equality (similarity,

matching) of similar objects described with multidimensional vectors or as even multidimensional images. Often an object is described as a known model and one has the problem of comparing the data of the model with those of the same object observed even in different conditions with respect to the model. Thus the need arises to define functionals or techniques that have the objective of verifying the level of similarity or drift (dissimilarity) between the data of the model with those observed. A technique known as template matching is used when comparing model data (template image) with those observed of the same physical entity (small parts of the captured image). Alternatively, we can use similarity functions (which assesses the level of similarity, affinity) based on correlation or dissimilarity functions based on the distance associated with a norm (which assesses the level of drift or distortion between data). In this context, the objects to be compared do not have a model but we want to compare small windows of stereo images, acquired simultaneously, which represent the same physical structure observed from slightly different points of view.

398

4 Paradigms for 3D Vision

tion (xR , yR ) = (m + i, n + j) with the resulting horizontal and vertical disparity, respectively, dx = xR − xL and dy = yR − yL . The correlation function, between the window WL and the search area Rm,n (i, j) is indicated for simplicity with C(i, j; m, n). Detected WL in the position (xL , yL ) in the left image IL , for each pixel position (i, j) in Rmn is calculated, a correlation measure between the window WL (seen as a template window) and the WR window (which slides in R by varying the indices i, j), with the following relation: C(i, j; m, n) =

+M 

+M 

k=−M l=−M

WL (xL + k, yL + l) · WR (i + n +l) (4.20)

m +k, j +

xR

yR

with i, j = −(N − M − 1), +(N − M − 1). The size of the square window WL is given by (2M + 1) with values of M = 1, 2, . . . generating windows of the size of 3 × 3, 5 × 5, etc, respectively. The size of the square search area Rm,n , located in (m = xL + dminx , n = yL + dminy ) in the right image IR , is given by (2N + 1) related to the size of the WR window with N = M + q, q = 1, 2, . . .. For each value of (i, j) where is centered WR in the search region R we have a value of C(i, j; m, n) and to move the window WR in R it is sufficient to vary the indices i, j = −(N − M − 1), +(N − M − 1) inside R whose dimensions and position can be defined a priori in relation to the geometry of the stereo system which allows a maximum and minimum interval of disparity. The accuracy of the correlation measurement depends on the variability of the gray levels between the two stereo images. To minimize this drawback, the correlation measurements C(i, j; m, n) can be normalized using the correlation coefficient 10 r(i, j; m, n) as the new correlation estimation value given by

+M k=−M l=−M WL (k, l; xL , yL ) · WR (k, l; i + m, j + n) 2  +M 2

+M 

+M  k=−M l=−M WL (k, l) k=−M l=−M WR (k, l; i + m, j + n)

+M

r(i, j; m, n) =  +M

(4.21)

where WL (k, l; xL , yL ) = WL (xL + k, yL + l) − W¯ L (xL , yL ) WR (k, l; i + m, j + n) = WR (i + m + k, j + n + l) − W¯ R (i + m, j + n)

(4.22)

with i, j = −(N − M − 1), +(N − M − 1) and (m, n) prefixed as above. W¯ L and W¯ R are the mean values of the intensity values in the two windows. The correlation coefficient is also known as Zero Mean Normalized Cross-Correlation - ZNCC. The numerator of the (4.21) represents the covariance of the pixel intensities between the two windows while the denominator is the product of the respective standard deviations. It can easily be deduced that the correlation coefficient r(i, j; m, n) takes scalar values in the range between −1 and +1, no longer depending on the variability of the intensity levels in the two stereo images. In particular, r = 1 corresponds to 10 The

correlation coefficient has been described in Sect. 1.4.2 and in this case it is used to evaluate the statistical dependence of the intensity of the pixels between the two windows, without knowing the nature of this statistical dependence.

4.7 Stereo Vision Algorithms

399

the exact equality of the elementary structures (homologous structures, less than a constant factor c, WR = cWL , that is, the two windows are very correlated but with uniform intensity, one clearer than the other). r = 0 means that they are completely different, while r = −1 indicates that they are anticorrelated (i.e., the intensity of the corresponding pixels are equal but of opposite sign). The previous correlation measures, for different applications, may not be adequate due to the noise present in the images and in particular when the research regions are very homogeneous with little variability in intensity values. This generates very uncertain or uniform correlation values C or r with the consequent uncertainty in the estimation of horizontal and vertical disparities (dx , dy ). More precisely, if the windows WL and WR represent the intensities in two images obtained under different lighting conditions of a scene and the corresponding intensities are linearly correlated, a high similarity between the images will be obtained. Therefore, the correlation coefficient is suitable for determining the similarity between the windows with the assumed intensities to be linearly correlated. When, on the other hand, the images are acquired under different conditions (sensors and nonuniform illumination), so that the corresponding intensities are correlated in a nonlinear way, the two perfectly matched windows may not produce sufficiently high correlation coefficients, causing misalignments. Another drawback is given by the intensive calculation required especially when the size of the windows increases. In [25] an algorithm is described that optimizes the computational complexity for the problem of template matching between images.

4.7.2.2 Distance-Based Matching Function An alternative way is to consider a dissimilarity measure to evaluate the diversity of pixel intensities between two images or windows of them. This dissimilarity measure can be associated with a metric, thus producing a very high value to indicate the difference between the two images. As highlighted above with the metrics used for the calculation of similarity, the similarity/dissimilarity metrics are not always functional, especially when the environmental lighting conditions change the intensity of the two images to be compared in a nonlinear way. In these cases, it may be useful to use nonmetric measures. If the stereo images are acquired under the same environmental conditions, the correspondence between the WL and WR windows can be evaluated with the simple dissimilarity measure known as the sum of the squares of the intensity differences (SSD—Sum of Squared Difference) or with the sum of absolute differences (SAD— Sum of Absolute Difference) instead of the products WL · WR . These dissimilarity measures are given by the following expressions:  2

+M CSSD (i, j; m, n) = +M (4.23) k=−M l=−M WL (xL +k,yL +l)−WR (i+m+k,j+n+l)

+M

CSAD (i, j; m, n) =

k=−M

+M

l=−M

|WL (xL +k,yL +l)−WR (i+m+k,j+n+l)|

with i, j = −(N − M − 1), +(N − M − 1).

(4.24)

400

4 Paradigms for 3D Vision

For these two dissimilarity measures CSSD (i, j; m, n) and CSAD (i, j; m, n) (substantially based, respectively, on the norm L2 and L1 ) the minimum value is chosen as the best match between the windows WL (xL , yL ) and WR (xR , yR ) = WR (i + m, j + n) which are chosen as homologous local structures with estimate of the disparity (dx = xR − xL , dy = yR − yR ) (see Fig. 4.56). The SSD metric is less expensive computationally than the correlation coefficient (4.21), and as the latter can be normalized to obtain equivalent results. In literature there are several methods of normalization. The SAD metric is mostly used as it requires less computational load. All matching measurements described are sensitive to geometric deformations (skewing, rotation, occlusions, . . .) and radiometric (vignetting, impulse noise, . . .). The latter can also be attenuated for the SSD and SDA metrics by subtracting the value of the average W¯ calculated on the windows to be compared from each pixel, as already done for the calculation of the correlation coefficient (4.21). The two metrics become the Zero-mean Sum of Squared Differences (ZSSD) and the Zero-mean Sum of Absolute Differences (ZSAD) and their expressions, considering the Eq. (4.22), are +M 

CZSSD (i, j; m, n) =

+M  2  WL (k, l; xL , yL ) − WR (k, l; i + m, j + n)

(4.25)

k=−M l=−M

CZSAD (i, j; m, n) =

+M 

+M       WL (k, l; xL , yL ) − WR (k, l; i + m, j + n)

(4.26)

k=−M l=−M

with i, j = −(N − M − 1), +(N − M − 1). In analogy to the normalization made with the correlation coefficient, to make the measurement insensitive to the contrast of the image, even for the dissimilarity measurement ZSSD, we can normalize the intensities of the two windows first with respect to its mean, and therefore, with respect to their standard deviations obtaining the measure Normalized Zero-mean Sum of Squared Differences (NZSSD). Therefore, similar to the correlation coefficient, this measurement is suitable for comparing images captured under different lighting conditions. When images with impulsive noise are known, the ZSAD and ZSSD dissimilarity measures, based on the L1 and L2 norms, respectively, produce high distance measurements. In this case, to reduce the effect of the impulse noise on the calculated dissimilarity measure, instead of the mean of the absolute differences or of the squared differences, the median of the absolute differences (MAD) or the median of the squared differences (MSD) can be used to measure the dissimilarity between the two windows WL and WR . The calculation of MAD involves the search for the absolute intensity differences of the corresponding pixels in the two windows (for MSD instead involves evaluating the squared intensity differences), the ordering of absolute differences (for MSD the ordering of the squared differences) and then the median value was chosen as a measure of dissimilarity. The median filter is described in Sect. 9.12.4 Vol. I, and in this context, it is used considering the intensity values of the two windows that are ordered in a one-dimensional vector and then discarding half of the major values (absolute differences for MAD and square differences for MSD). In addition to impulse noise, MAD and MSD can be effective measures

4.7 Stereo Vision Algorithms 7x7

9x9

11x11

CorrNorm

5x5

SSD

3x3

SAD

Real depth map

Left image IL

Right image IR

Original

401

Fig. 4.57 Calculation of the depth map from a pair of stereo images by detecting homologous local elementary structures using similarity functions. The first column shows the stereo images and real depth map of the scene. The following columns show the results of depth maps detected with windows of increasing size starting from 3 × 3, while the corresponding windows in stereo images are calculated with correlation similarity functions (first row), SSD (second row) and SAD (third line)

in determining the dissimilarity between windows containing occluded parts of the observed scene. Returning to the size of the windows, in addition to influencing the computational load, it is sensitive to the precision with which homologous structures are located in stereo images. A small window locates structures with greater precision but is more sensitive to noise, while on the contrary, a large window is more robust to noise but reduces the localization accuracy on which the disparity value depends. Finally, a larger window tends to violate the continuity of disparity constraint. Figure 4.57 shows the results of the depth maps extracted by detecting various elementary local structures (starting from 3 × 3 square windows) and using the functions of similarity/dissimilarity described above to find homologous windows in the two stereo images.

4.7.2.3 Matching Function Based on the Rank Transform An alternative dissimilarity measure, not metric, known as the Rank Distance-RD, is based on the Rank Transform [26]. Useful in the presence of significant radiometric variations and occlusion in stereo images.

402

4 Paradigms for 3D Vision

The rank transform is applied to the two windows of the stereo images where for each pixel the intensity is replaced with its Rank(i, j). For a given window W in the image, centered in the pixel p(i, j) the transformed rank RankW (i, j) is defined as the number of pixels in which the intensity is less than the value of p(i, j). For  79W42for 51 example, if W = 46 36 34 it is had that RankW (i, j) = 3 being there three pixels 37 30 28 with intensity less to the central pixel with value 36. It is evidenced, that the obtained values are on the base of the relative order of intensity of the pixels rather than the same intensities. The position of the pixels inside the window is also lost. Using the preceding symbolism (see also Fig. 4.56), the dissimilarity measure of rank distance RD(i, j) based on the rank transform Rank(i, j) is given by the following: RD(i, j; m, n) =

+M 

+M      RankWL (xL + k, yL + l) − RankWR (i + m + k, j + n + l) (4.27)

k=−M l=−M

In the (4.27) the value of RankW for a window centered in (i, j) in the image is calculated as follows: RankW (i, j) =

+M 

+M 

L(i + k, j + l)

(4.28)

k=−M l=−M

where L(k, l) is given by L(k, l) =



1 0

if W (k, l) < W (i, j) otherwise

(4.29)

With the (4.29) is calculated the number of pixels that in the window W are with value of the intensity less than the central pixel (i, j). Once the rank is calculated with the (4.28) for the window WL (xL , yL ) and the windows WR (i + m, j + n), i, j = −(NM − 1), +(NM − 1) in the search region R(i, j) located in (m, n) in the image IR , the comparison between the windows is evaluated using the SAD method (sum of absolute differences) with the (4.27). Rank distance is not a metric11 like all other measures based on ordering. The dissimilarity measure based on the rank transform actually compresses the information content of the image (the information of a window is encoded in a single value) thus reducing the potential discriminating ability of the comparison between the windows. The choice of window size becomes even more important in this method. The computational complexity of the rank distance is reduced compared to the methods of correlation and ordering. Evaluated in the order of nlog2 n, where n indicates the 11 Rank

distance is not a metric because it does not satisfy the reflexivity property of metrics. If WL = WR we have that each corresponding pixel has the same intensity value and it follows that RD = 0. However, when RD = 0, being RD the sum of nonnegative numbers given by the Eq. (4.27), would require |RankWL (i, j) − RankWR (i, j)| = 0 for each pixel corresponding to the two windows. But the intensity of the pixels can vary to less than an offset or a scale factor being able to still produce RD = 0, and therefore, does not imply that WL = WR thus violating the reflexivity property of the metric that in this case would necessarily impose RD(WL , WR ) = 0 ⇔ WL = WR .

4.7 Stereo Vision Algorithms

(a)

403

(b) Rank transform of left image

(c)

Disparity map using window 9x9

(d)

Census transform of left image

Disparity map using window 5x5

Fig. 4.58 Calculation of the depth map, for the same stereo images of Fig. 4.57, obtained using the census and rank transform. a Transformed census applied to the left image; b map of disparity based on the census transform and Hamming distance; c transformed rank applied to the left image; d map of disparity based on the rank distance and the SAD matching method

number of pixels of the W window. Experimental results have demonstrated, with this method, a reduction of ambiguities in the comparison of windows, in particular in stereo images with local variations of intensity and presence of impulse noise. Figure 4.58 shows the results of the rank distance (figure a) applied to the left image of Fig. 4.57 and the disparity map (figure b) based on the rank distance and the SAD matching method.

4.7.2.4 Matching Function Based on the Hamming Distance of the Census Transform A variant of the transformed rank is known as the Census Transform - CT [27] capable of maintaining the information of the spatial distribution of pixels in assessing the rank by generating a sequence of binary numbers. The most widespread version of the census transformation uses windows 3 × 3 to transform the pixel intensities of the input image into binary encoding by analyzing the pixels around the pixel under examination. The values of the surrounding pixels are compared to the central pixel of the current window. A binary mask 3 × 3 is produced associated with the central pixel comparing it with the 8 neighboring pixels thus generating the 8-bit binary encoding of the pixel under examination. The census function that generates the bit string by comparing the central pixel of the window with the neighboring ones is  1 CensusW (i, j) = BitstringW [W (k, l) < W (i, j)] = 0

if W (k, l) < W (i, j) otherwise

(4.30)

For example, by applying to the same window preceding the census function for the central pixel with value 36 the following binary coding is obtained  79 42 51  0 0 0 W = 46 36 34 ⇔ 0 (i,j) 1 ⇔ (0 0 0 0 1 0 1 1)2 ⇔ (11)10 37 30 28

0 1 1

From the result of the comparison, we obtain a binary mask whose bits, not considering the central bit, are concatenated by row, from left to right, and the final 8 bit code is formed which represents the central pixel of the W window. Finally, the

404

4 Paradigms for 3D Vision

decimal value representing this 8-bit code is assigned as the central pixel value of W . The same operation is performed for the window to be compared. The dissimilarity measure DCT , based on the CT , is evaluated by comparing with the Hamming distance, which evaluates the number of different bits, between the bit strings relating to the windows to be compared. The dissimilarity measure DCT is calculated according to the (4.30) as follows: DCT (i, j; m, n) =

+M 

+M 

DistHamming [CensusWL (xL + k, yL + l)−

k=−M l=−M

(4.31)

CensusWR (i + m + k, j + n + l)]

where CensusWL represents the census bit string of left window WL (xL , yL ) on the left stereo image and CensusWR (i + m, j + n) represents the census bit string of the windows in the search area R in the right stereo image that must be compared (see Fig. 4.56). The function (4.30) generates the census bit string whose length depends, as seen in the example, on the size of the windows, i.e., Ls = (2M − 1)2 − 1 which is given by the number of pixels in the window minus the middle one (in the example Ls = 8). Compared to the rank transformed there is a considerable increase in the dimensionality of the data depending on the size of the windows to be compared, with the consequent increase in the computationality required. For this method, real-time implementations, based on ad hoc hardware (based on FPGA-Field Programmable Gate Arrays), have been developed for the treatment of binary strings. Experimental results have shown that these last two dissimilarity measurements, based on rank and census transforms, are more efficient than correlation-based methods, to obtain disparity maps from stereo images in the presence of occlusions and radiometric distortions. Figure 4.58 shows the results of the census transform (figure c) applied to the left image of Fig. 4.57 and the disparity map (figure d) obtained on the basis of the census transform and the Hamming distance.

4.7.2.5 Gradient-Based Matching Function More reliable similarity measurements can be obtained by operating on the magnitude of the gradient of the images to replace the intensity levels that are more sensitive to noise [28]. For the calculation of depth maps from stereo images with small disparities it is possible to use an approach based on the optical flux formulated through a differential equation that links the information of motion to that of the intensity of the stereo images (∇x I )ν + It = 0

(4.32)

where (∇x I ) indicates the horizontal component of the gradient of the image I , It is the time derivative which in this context refers to the differences in intensity of the two stereo images, and ν translation between the two images. The functional of the optical flux (4.32) assumes that the lighting conditions of the scene do not change during the acquisition of stereo images. A map of dense disparity it can be iteratively estimated with the least squares method associated with the system of

4.7 Stereo Vision Algorithms

405

differential equations generated by applying the (4.32) to each pixel of the window centered on the pixel being processed, imposing the constraint that the disparity varies continuously on the pixels of the window same. This process of minimizing this functional is repeated in each pixel of the image to estimate the disparity. This methodology is described in detail in Sect. 6.4. Another way to use the gradient is to accumulate the horizontal and vertical gradient difference calculated for each pixel of the two windows WL and WR to compare. As a result, a measure of dissimilarity between the accumulated gradients of the two windows is obtained. We can consider dissimilarity measures DG SAD and DG SSD based, respectively, on the sum of the absolute differences (SAD) and on the sum of the squares of the differences (SSD) of the horizontal (∇x W ) and vertical components of the gradient of the local structures to be compared, respectively, (∇x W ) and (∇y W ). In this case the DG dissimilarity measures of the local structures to be compared, based on the components of the gradient vector accumulated according to the sum SSD or SAD, are given, respectively, by the following functions: +M 

DG SSD (i, j; m, n) =

+M  

2 ∇x WL (xL + k, yL + l) − ∇x WR (i + m + k, j + n + l) +

k=−M l=−M



∇y WL (xL + k, yL + l) − ∇y WR (i + m + k, j + n + l)

DG SAD (i, j; m, n) =

+M 

+M 

2

|∇x WL (xL + k, yL + l) − ∇x WR (i + m + k, j + n + l)|+

k=−M l=−M

|∇y WL (xL + k, yL + l) − ∇y WR (i + m + k, j + n + l)|

using the same symbols defined in the previous paragraphs. In [29] a technique is proposed that uses the module of the gradient vectors |∇WL | and |∇WR | to be compared to derive a similarity measure (called evidence measure) as the weighted sum of two terms: the average gradient modules (|∇WL |+|∇WR |)/2 and the negation of the difference of the same modules of the gradient. So the EG evidence measure based on the two terms would result EG (i, j; m, n) =

|∇WL (xL , yL | + |∇WR (i + m, j + n)| −α|∇WL (xL , yL )−∇WR (i+m, j+n)| 2

where α is the weight parameter that balances the two terms. Great value of EG implies great similarity between the compared gradient vectors.

4.7.2.6 Global Methods for Correspondence Methods that evaluate the correspondence on a global basis (i.e., analyzing the entire image to assign the disparity), as an alternative to local and point-like methods, have been proposed and are useful to further reduce the problem of local intensity discontinuity, occlusion, and presence of uniform texture in stereo images. The solution to these problems requires global constraints that involve an increase in computational complexity and cannot be easily implemented at present in real time. Precisely to reduce as much as possible the computational load, various methods based on dynamic programming (assuming ordering and continuity constraints) and

406

4 Paradigms for 3D Vision

on minimizing a global energy function formulated as the sum of evidence and compatibility have been proposed. Methods based on dynamic programming attempt to reduce the complexity of the problem into smaller and simpler subproblems by setting a functional cost in the various stages with appropriate constraints. The best known algorithms are: Grapf-Cuts [30], Belief Propagation [31], Intrinsic Curves [32], Nonlinear Diffusion [33]. Further methods have been developed, known as hybrid or semiglobal methods, which essentially use similar approaches to global ones, but operate on parts of the image, such as line by line image, to reduce the considerable computational load required by global methods.

4.7.3 Sparse Elementary Structures So far we have generically considered point-like and local elementary structures represented by windows WL and WR with dimensions (2M + 1) in stereo images without specifying with which criterion the windows WL candidates for the search of the corresponding window WR in the right image. One can easily think of excluding homogeneous structures with little presence of texture. The objective is to analyze the image of the left and identify a set of significant structures, point-like or local, that are probably also found in the image on the right. These structures are those detected with the algorithms described in Chap. 6 Vol. II, Points and descriptors of points of interest, or the Structures of Interest (SdI) (border elements, small or large areas with texture) calculated in different positions (x, y) in stereo images, without the constraints of epipolarity and knowledge of the parameters of the binocular system. This chapter also contains some examples of searching for homologous points of interest for stereo image pairs using the SIFT algorithm.

4.7.4 PMF Stereo Vision Algorithm The PMF stereo vision algorithm: A stereo correspondence algorithm using a disparity gradient limit, proposed in 1985 by Pollard, Mayhem, and Frisby [28], is based on a computational model in which the problem of stereo correspondence is seen as intimately integrated with the problem of identifying and describing elementary structures (primal sketch) candidates and homologues. This contrasts with the computational model proposed by Marr and Poggio which see the stereo vision with separate modules for the identification of elementary structures and the problem of correspondence. The PMF algorithm differs mainly because it includes the constraint of the continuity of the visible surface (figural continuity). Candidate structures as homologous must have a disparity value contained in a certain range and if there are more candidate structures as homologous, some will have to be eliminated, verifying if structures in the vicinity of the candidate, support the same relation of surface continuity (figural continuity).

4.7 Stereo Vision Algorithms

Left image

407

Cyclopic Image

Ac

Al

Right image

Ar

Bl xl

xr

Bc

Br

Fig. 4.59 Cyclopic separation and disparity gradient

The constraint of the figural continuity eliminates many false pairs of homologous structures applied to natural and simulated images. In particular, PMF exploits what was experienced by Burt-Julesz [34] who found, in human binocular vision, how the image fusion process is tolerated for homologous structures with disparity gradient with value 1. The disparity gradient of two elementary structures A and B (points, contours, zero crossing, SdI, etc.) identified in the pair of stereo images is given by the ratio of the difference between their disparity value and their cyclopean separation (see Fig. 4.59, concept of disparity gradient). The disparity gradient measures the relative disparity of two pairs of homologous elementary structures. The authors of the PMF algorithm assert that the fusion process of binocular images can tolerate homologous structures within the unit value of the disparity gradient. In this way, false homologous structures are avoided and implicitly satisfy the constraint of the continuity of the visible surface introduced previously. Now let’s see how to calculate the disparity gradient by considering pairs of points (A, B) in the two stereo images (see Fig. 4.59). Let AL = (xAL , yAL ) and BL = (xBL , yBL ), AR = (xAR , yAR ) and BR = (xBR , yBR ) the projection in the two stereo images of the points A and B of the visible 3D surface. The disparity value for this pair of points A and B is given by dA = xAR − yAL

and

dB = xBR − yBL

(4.33)

A cyclopic image (see Fig. 4.59) is obtained by projecting the considered points A and B, respectively, in the points Ac and Bc whose coordinates are given by the average of the coordinates of the same points A and B projected in the stereo image pair. The coordinates of the cyclopic points Ac and Bc are xAc =

xAL + xAR 2

and

yAc = yAL = yAR

(4.34)

xBc =

xBL + xBR 2

and

yBc = yBL = yBR

(4.35)

408

4 Paradigms for 3D Vision

The cyclopean separation S is given by the Euclidean distance between the cyclopic points  S(A, B) = (xAc − xBc )2 + (yAc − yBc )2    x + x   y + y 2 AL AR BL BR − = + (yAc − yBc )2 2 2 (4.36)    2  2 1  = + yAc − yBc xAL − xBL + xAR − xBR 4   2  2 1 dx (AL , BL ) + dx (AR , BR ) + yAc − yBc = 4 where dx (AL , BL )and dx (AR , BR ) are the horizontal distances of points A and B projected in the two stereo images. The difference in disparity between the pairs of points (AL , AR ) and (BL , BR ) is given as follows: D(A, B) = dA − dB     = xAR − xAL − xBR − xBL     (4.37) = xAR − xBR − xAL − xBL = dx (AR − BR ) − dx (AL − BL ) The disparity gradient G for the pair of homologous points (AL , AR ) and (BL , BR ) is given by the ratio of the difference of disparity D(A, B) with the ciclopic separation S(A, B) given by the Eq. (4.36) G(A, B) =

D(A, B) dx (AR , BR ) − dx (AL , BL ) =   2  S(A, B) 2 1 4 dx (AL , BL ) + dx (AR , BR ) + yAc − yBc

(4.38)

With the definition given to the disparity gradient G, the constraint of the disparity gradient limit is immediate since the Eq. (4.38) shows that G can never exceed the unit. It follows that small differences in disparities are not acceptable if the points A and B, considered in the 3D space, are very close to each other. This is easy to understand and is supported by physical evidence as experienced by the PMF authors. The PMF algorithm includes the following steps: 1. Identification in the pair of stereo images of the elementary candidate structures to evaluate the correspondence; 2. Calculus of the homologous structures in the conditions of epipolar geometry. The search for homologous structures takes place only by analyzing the pixels of the stereo images horizontally; 3. Assumption that the correspondence is unique, i.e., a structure in an image has only one homologous structure in the other image and vice versa. It is obvious that there will be situations, due to occlusion, that some structure has more than one structure in the other image. For each candidate pair as homologous, a likelihood index is increased according to the number of other homologous structures found, which do not violate the limit of the chosen disparity gradient.

4.7 Stereo Vision Algorithms

409

4. Choosing homologous structures with the highest likelihood index. The uniqueness constraint removes other incorrect homologous pairs and is excluded from further considerations. 5. Return in step 2 and the indices are redetermined considering the derived homologous points. 6. The algorithm terminates when all possible pairs of homologous points have been extracted. From the procedure described it is observed that the PMF algorithm assumes that it finds a set of candidate points as homologous from each stereo image and proposes to find the correspondence for pairs of points (A, B), i.e., for pairs of homologous points. The calculation of the correspondence of the single points is facilitated by the constraint of the epipolar geometry, and the uniqueness of the correspondence is used in step 4 to prevent the same point from being used more than once in the calculation of the disparity gradient. The estimation of the likelihood index is used to check that the more unlikely the correspondences are, the more distant they are from the limit value of the disparity gradient. In fact, candidate pairs of homologous points are considered those that have the disparity gradient close to the unit. It is reasonable to consider only pairs that fall within a circular area of radius seven, although this value depends on the geometry of the vision system and the scene. This means that small values of the D disparity difference are easily detected and discarded when caused by points that are very close together in 3D space. The PMF algorithm has been successfully tested for several natural and artificial scenes. When the points of uniqueness, the epipolarity and the limits of the disparity gradient are violated, the results are not good. In these latter conditions, it is possible to use algorithms that calculate the correspondence for a number of points greater than two. For example, we can organize the Structures of Interest (SdI) as a set of nodes related to each other in topological terms. Their representation can be organized with a graph H(V,E) where V = {A1 , A2 , . . . , An } is the set of Structures of Interest Ai and E = {e1 , e2 , . . . , en } is the set of arcs that constitute the topological relations between nodes (for example, based on the Euclidean distance). In this case, the matching process is reduced to looking for the structures of interest SdIL and SdIR in the pair of stereo images, organizing them in the form of graphs, and subsequently performing the comparison of graphs or subgraphs (see Fig. 4.60). In graph theory, the comparison of graphs or subgraphs is called isomorphism of graph or subgraph. The problem of the graph isomorphism however emerges considering that the graphs HL and HR , which can be generated with the set of potentially homologous points, can be many, and such graphs can never be identical due to of the diversity of the stereo image pair. The problem has a solution as it is set by evaluating the similarity of the graphs or subgraphs. In the literature several algorithms are proposed for the problem of graph comparison [35–38].

410

4 Paradigms for 3D Vision

HL

Left image IL

HR

Right image IR

Fig. 4.60 Topological organization between structures of interest

References 1. D. Marr, S. Ullman, Directional selectivity and its use in early visual processing, in Proceedings of the Royal Society of London. Series B, Biological Sciences, vol. 211 (1981), pp. 151–180 2. D. Marr, E. Hildreth, Theory of edge detection, in Proceedings of the Royal Society of London. Series B, Biological Sciences, vol. 207 (1167) (1980), pp. 187–217 3. S.W. Kuffler, Discharge patterns and functional organization of mammalian retina. J. Neurophysiol. 16(1), 37–68 (1953) 4. C. Enroth-Cugell, J.G. Robson, The contrast sensitivity of retinal ganglion cells of the cat. J. Neurophysiol. 187(3), 517–552 (1966) 5. F.W. Campbell, J.G. Robson, Application of fourier analysis to the visibility of gratings. J. Physiol. 197, 551–566 (1968) 6. D.H. Hubel, T.N. Wiesel, Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. 160(1), 106–154 (1962) 7. V. Bruce, P. Green, Visual Perception: Physiology, Psychology, and Ecology, 4th edn. (Lawrence Erlbaum Associates, 2003). ISBN 1841692387 8. H. von Helmholtz, Handbuch der physiologischen optik, vol. 3 (Leopold Voss, Leipzig, 1867) 9. R.K. Olson, F. Attneave, What variables produce similarity grouping? J. Physiol. 83, 1–21 (1970) 10. B. Julesz, Visual pattern discrimination. IRE Trans. Inf. Theory 8(2), 84–92 (1962) 11. M. Minsky, A framework for representing knowledge, in The Psychology of Computer Vision, ed. by P. Winston (McGraw-Hill, New York, 1975), pp. 211–277 12. D. Marr, H. Nishihara, Representation and recognition of the spatial organization of threedimensional shapes. Proc. R. Soc. Lond. 200, 269–294 (1987) 13. I. Biederman, Recognition-by-components: a theory of human image understanding. Psychol. Rev. 94, 115–147 (1987) 14. J.E. Hummel, I. Biederman, Dynamic binding in a neural network for shape recognition. Psychol. Rev. 99(3), 480–517 (1992)

References

411

15. T. Poggio, C. Koch, Ill-posed problems in early vision: from computational theory to analogue networks, in Proceedings of the Royal Society of London. Series B, Biological Sciences, vol. 226 (1985), pp. 303–323 16. B. Julesz, Foundations of Cyclopean Perception (The MIT Press, 1971). ISBN 9780262101134 17. L. Ungerleider, M. Mishkin, Two cortical visual systems, in ed. by D.J. Ingle, M.A. Goodale, R.J.W. Mansfield, Analysis of Visual Behavior (MIT Press, Cambridge MA, 1982), pp. 549–586 18. D. Marr, Vision: A Computational Investigation into the Human Representation and Processing of Visual information, 1st edn. (The MIT Press, 2010). ISBN 978-0262514620 19. W.E.L. Grimson, From Images to Surfaces: A Computational Study of the Human Early Visual System, 4th edn. (MIT Press, Cambridge, Massachusetts, 1981). ISBN 9780262571852 20. D. Marr, T. Poggio, A computational theory of human stereo vision, in Proceedings of the Royal Society of London. Series B, vol. 204 (1979), pp. 301–328 21. O. Faugeras, Three-Dimensional Computer Vision: A Geometric Approach (MIT Press, Cambridge, Massachusetts, 1996) 22. R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, 2nd edn. (Cambridge, 2003) 23. R.Y. Tsai, A versatile camera calibration technique for 3D machine vision. IEEE J. Robot. Autom. 4, 323–344 (1987) 24. Z. Zhang, A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000) 25. A. Goshtasby, S.H. Gage, J.F. Bartholic, A two-stage cross correlation approach to template matching. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-6, 374–378 (1984) 26. W. Zhang, K. Hao, Q. Zhang, H. Li, A novel stereo matching method based on rank transformation. Int. J. Comput. Sci. Issues 2(10), 39–44 (2013) 27. R. Zabih, J. Woodfill, Non-parametric local transforms for computing visual correspondence, in Proceedings of the 3rd European Conference Computer Vision (1994), pp. 150–158 28. S.B. Pollard, J.E.W. Mayhew, J.P. Frisby, PMF: a stereo correspondence algorithm using a disparity gradient limit. Perception 14, 449–470 (1985) 29. D. Scharstein, Matching images by comparing their gradient fields, in Proceedings of 12th International Conference on Pattern Recognition, vol. 1 (1994), pp. 572–575 30. Y. Boykov, O. Veksler, R. Zabih, Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 11(23), 1222–1239 (2001) 31. J. Sun, N.N. Zheng, H.Y. Shum, Stereo matching using belief propagation, in Proceedings of the European Conference Computer Vision (2002), pp. 510–524 32. C. Tomasi, R. Manduchi, Stereo matching as a nearest-neighbor problem. IEEE Trans. Pattern Anal. Mach. Intell. 20, 333–340 (1998) 33. D. Scharstein, R. Szeliski, Stereo matching with non linear diffusion. Int. Jorn. Comput. Vis. 2(28), 155–174 (1998) 34. P. Burt, B. Julesz, A disparity gradient limit for binocular fusion. Science 208, 615–617 (1980) 35. N. Ayache, B. Faverjon, Efficient registration of stereo imagesby matching graph descriptions of edge segments. Inter. J. Comput. Vis. 2(1), 107–131 (1987) 36. D.H. Ballard, C.M. Brown, Computer Vision (Prentice Hall, 1982). ISBN 978-0131653160 37. Radu HORAUD and Thomas SKORDAS, Stereo correspondence through feature grouping and maximal cliques. IEEE Trans. PAMI 11(11), 1168–1180 (1989) 38. A. Branca, E. Stella, A. Distante, Feature matching by searching maximum clique on high order association graph, in International Conference on Image Analysis and Processing (1999), pp. 642–658

5

Shape from Shading

5.1 Introduction With Shape from Shading, in the field of computer vision, we intend to reconstruct the shape of the visible 3D surface using only the brightness variation information, i.e., the gray-level shades present in the image. It is well known that an artist is able to represent the geometric shape of the objects of the world in a painting (black/white or color) creating shades of gray or color. Looking at the painting, the human visual system analyzes these shades of brightness and can perceive the shape information of 3D objects even if represented in the two-dimensional painting. The author’s ability consists in projecting, the 3D scene, in the 2D plane of the painting, creating to the observer, through the shades of gray (or color) level, the impression of the 3D vision of the scene. The approach of Shape from Shading essentially proposes the analogous problem: from the variation of luminous intensity of the image we intend to reconstruct the visible surface of the scene. In other words, the inverse problem, to reconstruct the shape of the visible surface from the brightness variations present in the image is known as the Shape from Shading problem. In Chap. 2 Vol. I, the fundamental aspects of radiometry involved in the image formation process were examined, culminating in the definition of the fundamental formula of radiometry. These aspects will have to be considered to solve the problem of Shape from Shading by finding solutions based on reliable physical–mathematical foundations, also considering, the complexity of the problem. The statement reconstruction of the visible surface must not be strictly understood as a 3D reconstruction of the surface. We know, in fact, that from a single point of observation of the scene, a monocular vision system cannot estimate a distance

© Springer Nature Switzerland AG 2020 A. Distante and C. Distante, Handbook of Image Processing and Computer Vision, https://doi.org/10.1007/978-3-030-42378-0_5

413

414

5 Shape from Shading

measure between observer and visible object.1 Horn [1] in 1970 was the first to introduce the Shape from Shading paradigm by formulating a solution based on the knowledge of the light source (direction and distribution), the scene reflectance model, the observation point and the geometry of the visible surface, which together contribute to the process of image formation. In other words, Horn has derived the relations between the values of the luminous intensity of the image and the geometry of the visible surface (in terms of the orientation of the surface point by point) under some lighting conditions and the reflectance model. To understand the paradigm of the Shape from Shading, it is necessary to introduce two concepts: the reflectance map and the gradient space.

5.2 The Reflectance Map The basic concept of the reflectance map is to determine a function that calculates the orientation of the surface from the brightness of each point of the scene. The fundamental relation Eq. (2.34), of the image formation process, described in Chap. 2 Vol. I, allows to evaluate the brightness value of a generic point a of the image, generated by the luminous radiance reflected from a point A of the object.2 We also know that the process of image formation is conditioned by the characteristics of the optical system (focal length f and diameter of the lens d) and by the model of reflectance considered. We remember the fundamental equation that links the image irradiance E to the scene radiance is given by the following: E (a, ψ) =

π 4

 2 d cos4 ψ · L (A, ) f

(5.1)

where L is the radiance of the object’s surface, and  is the angle formed with the optical axis of the incident light ray coming from a generic point A of the object (see Fig. 5.1). It should be noted that the image irradiance E, is linearly related to the radiance of the surface, it is proportional to the area of the lens (defined by the diameter d), it is inversely proportional to the square of the distance between lens and image plane (dependent on the focal f ), and that the irradiance decreases with the increase of the angle ψ comprised between the optical axis and line of sight.

1 While

with the stereo approach we have a quantitative measure of the depth of the visible surface with the shape from shading we have a nonmetric but qualitative (ordinal) reconstruction of the surface. 2 In this paragraph, we will indicate with A the generic point of the visible surface and with a its projection in the image plane, instead of, respectively, P and p, indicated in the radiometry chapter, to avoid confusion with the Gradient coordinates that we will indicate in the next paragraph with ( p, q).

5.2 The Reflectance Map

415

d

ΔΩ P

θ

P

ΔΑ

Ψ

Ψ

Δa ΔΩp

p

f

Z

Fig. 5.1 Relationship between the radiance of the object and the irradiance of the image

I(x,y) a

Image Plane

Light source

Li(s)

S Ra

di

an

ce

s

θi

n

Φi

θe

y

ce

an

i ad

Irr

0

x

Φe A

z=Z(x,y)

E(X,Y,Z)

Y

Z 0

X

Fig. 5.2 Diagram of the components of a vision system for the shape from shading. The surface element receives in A the radiant flow L i from the source S with direction s = (θi , φi ) and a part of it is emitted in relation to the typology of material. The brightness of the I (x, y) pixel depends on the reflectance properties of the surface, on its shape and orientation defined by the angle θi between the normal n and the vector s (source direction), from the properties of the optical system and from the exposure time (which also depends on the type of sensor)

We also rewrite (with the symbols according to Fig. 5.2), Eq. (2.16) of the image irradiance, described in Chap. 2 Vol. I Radiometric model, given by I (x, y) ∼ = L e (X, Y, Z ) = L e (A, θe , φe ) = F(θi , φi ; θe , φe ) · E i (θi , φi )

(5.2)

where we recall that F is the BRDF function (Bidirectional Reflectance Distribution Function), E i (θi , φi ) is the incident irradiance associated with the radiance L i (θi , φi ) coming from the source S in the direction (θi , φi ), L e (θe , φe ) is the radiance emitted by the surface from A in the direction of the observer due to the incident

416

5 Shape from Shading

irradiance E i . It is hypothesized that the radiant energy L e (θe , φe ) from A is completely projected into a in the image (in the ideal lens constraint), generating the intensity I (x, y) (see Fig. 5.2). The reflectance model described by Eq. (5.2) informs us that it depends on the direction of the incident light coming from a source (point-like or extended), from the direction of observation, from the geometry of the visible surface (so far described by an element of surface centered in A(X, Y, Z )), and by the function BRDF (diffuse or specular reflection) which in this case has been indicated with the symbol F to distinguish it from the symbol f of focal length of the lens. With the introduction of the reflectance map, we want to explicitly formalize how the irradiance at the point a of the image is influenced simultaneously by three factors3 : 1. Reflectance model. A part of light energy is absorbed (for example, converted into heat), a part can be transmitted by the same illuminated object (by refraction and scattering) and the rest reflected in the environment which is the radiating component that reaches the observer. 2. Light source and direction. 3. Geometry of the visible surface, i.e., its orientation with respect to the observer’s reference system (also called viewer-centered). Figure 5.2 schematizes all the components of the image formation process highlighting the orientation of the surface in A and of the lighting source. The orientation of the surface in A(X, Y, Z ) is given by the vector n, which indicates the direction of the normal to the tangent plane to the visible surface passing through A. Recall that the BRDF function defines the relationship between L e (θe , φe ) the radiance reflected by the surface in the direction of the observer and E i (θi , φi ) the irradiance incident on the object at the point A coming from the source with known radiant flux L i (θi , φi ) (see Sect. 2.3 of Vol. I). In the Lambertian modeling conditions, the image irradiance I L (x, y) is given by Eq. (2.20) described in Chap. 2 Vol. I Radiometric model, which we rewrite as I L (x, y) ∼ = Ei ·

ρ ρ = L i cos θi π π

(5.3)

where 1/π is the BRDF of a Lambertian surface, θi is the angle formed by the vector n normal to the surface in A, and the vector s = (θi , φi ) which represents the direction of the irradiance incident E i in A generated by the source S which emits radiant flux incident in the direction (θi , φi ). ρ indicates the albedo of the surface, seen as reflectance coefficient which expresses the ability of the surface to reflect/absorb incident irradiance at any point. The albedo has values in the range 0 ≤ ρ ≤ 1 and for energy conservation, the missing part is due to absorption. Often a surface with a

3 In

this context, the influence of the optical system, exposure time, and the characteristics of the capture sensor are excluded.

5.2 The Reflectance Map

417

uniform reflectance coefficient is assumed with ρ = 1 (ideal reflector, while ρ = 0 indicates ideal absorber) even if in physical reality this hardly happens. From Eq. (5.3), it is possible to derive the expression of the Lambertian reflectance map, according to the 3 points above, given by R(A, θi ) =

ρ L i cos θi π

(5.4)

which relates the radiance reflected by the surface element in A (whose direction with respect to the incident radiant flux is given by θi ) with the irradiance of the image observed in a. Equation (5.4) specifies the brightness of a surface element taking into consideration: the orientation of the normal n of a surface element (with respect to which the incident radiant flux is evaluated), of the radiance of the source and of the model of Lambertian reflectance. In these conditions, the fundamental radiometry Eq. (2.37), described in Chap. 2 Vol. I, is valid, which expresses the equivalence between radiance L(A) of the object and irradiance E(a) of the image (symbols according to the note 2): I (a) = I (x, y) = I (i, j) ∼ (5.5) = E (a, ψ) = L (A, ) In this context, we can indicate with R(A, θi ) the radiance of the object expressed by (5.4) that by replacing in the (5.5) we get E(a) = R(A, θi )

(5.6)

which is the fundamental equation of Shape from Shading. Equation (5.6) directly links the luminous intensity, i.e., the irradiance E(a) of the image to the orientation of the visible surface at the point A, given by the angle θi between the surface normal vector n and the vector s of incident light direction (see Fig. 5.2). We summarize the properties of this simple Lambertian radiometric model: (a) The radiance of a surface element ( patch) A is independent of the point of observation. (b) The brightness of a pixel in a is proportional to the radiance of the corresponding surface element A of the scene. (c) The radiance of a surface element is proportional to the cosine of the angle θi formed between the normal to the patch and the direction of the source. (d) It follows that the brightness of a pixel is proportional to the cosine of this angle (Lambert’s cosine law).

5.2.1 Gradient Space The fundamental equation of Shape form Shading given by (5.6) is convenient to express it with the coordinates defined with respect to the reference system of the image. To do this, it is convenient to introduce the concept of gradient space and

418

5 Shape from Shading

Z

Light source

S (0,0)

1

q (p,q,1)

Z=1 (ps,qs)

s

n

A

p

z=

Z(x

,y) Y

X Fig. 5.3 Graphical representation of the gradient space. Plans parallel to the plane x − y (for example, the plane Z = 1, whose normal has components p = q = 0) have zero gradients in both directions in x and y. For a generic patch (not parallel to the plane x y) of the visible surface, the orientation of the normal vector ( p, q) is given by (5.10) and the equation of the plane comprising the patch is Z = px +qy +k. In the gradient space p −q the orientation of each patch is represented by a point indicated by the gradient vector ( p, q). The direction of the source given by the vector s is also reported in the gradient space and is represented by the point ( ps , qs )

to review both the reference systems of the visible surface and that of the source, which becomes the reference system of the image plane (x, y) as shown in Figs. 5.2 and 5.3. It can be observed how, in the hypothesis of a convex visible surface, for each of its generic point A we have a tangent plane, and its normal outgoing from A indicates the attitude (or orientation) of the 3D surface element in space represented by the point A(X, Y, Z ). The projection of A in the image plane identifies the point a(x, y) which can be obtained from the perspective equations (see Sect. 3.6 Vol. II Perspective Transformation): x= f

X Z

y= f

Y Z

(5.7)

remembering that f is the focal length of the optical system. If the distance of the object from the vision system is very large, the geometric projection model can be simplified assuming the following orthographic projection is valid: x=X

y=Y

(5.8)

which means projecting the points of the surface through parallel rays and, less than a scale factor, the horizontal and vertical coordinates of the reference system of the image plane (x, y) and of the reference system (X, Y, Z ) of the world, coincide. With f → ∞ implies that Z → ∞ for which Zf becomes the unit that justifies the (5.8) for the orthographic geometric model.

5.2 The Reflectance Map

419

Under these conditions, for the geometric reconstruction of the visible surface, i.e., determining the distance Z of each point P from the observer, it can be thought of as a function of the coordinates of the same point A projected in the image plane in a(x, y). Reconstructing the visible surface then means finding the function: z = Z (x, y)

(5.9)

The fundamental equation of Shape from Shading (5.6) cannot calculate the function of the distances Z (x, y) but, through the reflectance map R(A, θn ) the orientation of the surface can be reconstructed, point by point, obtaining the so-called orientation map. In other words, this involves calculating, for each point A of the visible surface, the normal vector n, calculating the local slope of the visible surface that expresses how the tangent plane passing through A is oriented with respect to observer. The local slopes at each point A(X, Y, Z ) are estimated by evaluating the partial derivatives of Z (x, y) with respect to x and y. The gradient of the surface Z (x, y) in the point P(X, Y, Z ) is given by the vector ( p, q) obtained with the following partial derivatives: p=

∂ Z (x, y) ∂x

q=

∂ Z (x, y) ∂y

(5.10)

where p and q are, respectively, the components relative to the x-axis and y-axis of the surface gradient. Calculated with the gradient of the surface, the orientation of the tangent plane, the gradient vector ( p, q) is bound to the normal n of the surface element centered in A, by the following relationship: n = ( p, q, 1)T

(5.11)

which shows how the orientation of the surface in A can be expressed by the normal vector n of which we are interested only as it is oriented and not the module. More precisely, (5.11) tells us that in correspondence of unitary variations of the distance Z , the variation x and y in the image plane, around the point a(x, y) must be p and q, respectively. To have the normal vector n unitary, it is necessary to divide the normal n to the surface for its length: ( p, q, 1) n = (5.12) nN = |n| 1 + p2 + q 2 The pair ( p, q) constitutes the gradient space that represents the orientation of each element of the surface (see Fig. 5.3). The gradient space expressed by the (5.11) can be seen as a plane parallel to the plane X-Y placed at the distance Z = 1. The geometric characteristics of the visible surface can be specified as a function of the coordinates of the image plane x-y while the coordinates p and q of the gradient space have been defined to specify the orientation of the surface. The map of the

420

5 Shape from Shading

orientation of the visible surface is given by the following equations: p = p(x, y)

q = q(x, y)

(5.13)

The origin of the gradient space is given by the normal vector (0, 0, 1), which is normal to the image plane, that is, with the visible surface parallel to the image plane. The more the normal vectors move away from the origin of the gradient space, the larger the inclination of the visible surface is compared to the observer. In the gradient space is also reported the direction of the light source expressed by the gradient components ( ps , qs ). The shape of the visible surface, as an alternative to the gradient space ( p, q), can also be expressed by considering the angles (σ, τ ) which are, respectively, the angle between the normal n and the Z -axis (reference system of the scene) which is the direction of the observer, and the angle between the projection of the normal n in the image plane and the x-axis of the image plane.

5.3 The Fundamental Relationship of Shape from Shading for Diffuse Reflectance Note the reflectance maps R(A, θi ) and the orientation of the source ( ps , qs ), it is possible to deal with the problem of Shape from Shading, that is to calculate ( p, q) from the pixel intensity at position (x, y). We must actually find the relation that links the luminous intensity E(x, y) in the image plane with the orientation of the surface expressed in the gradient space ( p, q). This relation must be expressed in the reference system of the image plane (viewer centered) with respect to which the normal vector n is given by ( p, q, 1) and the vector s indicating the direction of the source is given by ( ps , qs , 1). Recalling that the cosine of the angle formed by two vectors is given by the inner product of the two vectors divided by the length of the two vectors itself, in the case of the two vectors corresponding to the normal n and to the direction of the source s, the cosine of the angle of light incident θn , according to (5.12), results in the following: (− p, −q, 1) (− ps , −qs , 1) 1 + ps p + q s q  cos θn = n · s =  =   1 + p 2 + q 2 1 + ps2 + qs2 1 + p 2 + q 2 1 + ps2 + qs2

(5.14)

The reflectance map (i.e., the reflected radiance L e (A, θn ) of the Lambertian surface) expressed by Eq. (5.4) can be rewritten with the coordinates p and q of the gradient space and by replacing the (5.14) becomes R(A, θn ) =

1 + ps p + q s q ρ Li   π 1 + p 2 + q 2 1 + ps2 + qs2

(5.15)

This equation represents the starting point for applying the Shape from Shading, to the following conditions: 1. The geometric model of image formation is orthographic; 2. The reflectance model is Lambertian (diffuse reflectance);

5.3 The Fundamental Relationship of Shape from Shading for Diffuse Reflectance

421

3. The optical system has a negligible impact and the visible surface of the objects is illuminated directly by the source with incident radiant flow L i ; 4. The optical axis coincides with the Z -axis of the vision system and the visible surface Z (x, y) is described for each point in terms of orientation of the normal in the gradient space ( p, q); 5. Local point source very far from the scene. The second member of Eq. (5.15), which expresses the radiance L(A, θn ) of the surface, is called the reflectance map R( p, q) which when rewritten becomes R( p, q) =

1 + ps p + q s q ρ Li   π 1 + p 2 + q 2 1 + ps2 + qs2

(5.16)

The reflectance can be calculated for a certain type of material, and for a defined type of illumination it can be calculated for all possible orientations of p and q of the surface to produce the reflectance map R( p, q) which can have normalized values (maximum value 1) to be invariant with respect to the variability of the acquisition conditions. Finally, admitting the invariance of the radiance of the visible surface, namely that the radiance L(A), expressed by (5.15), is equal to the irradiance of the image E(x, y), as already expressed by Eq. (5.5), we finally get the following equation of image irradiance: E(x, y) = Rl,s ( p(x, y), q(x, y))

(5.17)

where l and s indicate the Lambertian reflectance model and the source direction, respectively. This equation tells us that the irradiance (or luminous intensity) in the image plane in the location (x, y) is equal to the value of the reflectance map R( p, q) corresponding to the orientation ( p, q) of the surface of the scene. If the reflectance map is known (computable with the 5.16), for a given position of the source, the reconstruction of the visible surface z = Z (x, y) is possible in terms of orientation ( p, q) of the same, for each point (x, y) of the image. Let us remember that the   ∂Z ∂Z orientation of the surface is given by the gradient p = ∂ x , q = ∂ y . Figure 5.4 shows two examples of reflectance map, represented graphically by iso-brightness curves, under the Lambertian reflectance model conditions and with point source illuminating in the direction ( ps = 0.7 and qs = 0.3) and ( ps = 0.0 and qs = 0.0) a sphere. In the latter condition, the incident light is adjacent to the observer, that is, source and observer see the visible surface from the same direction. Therefore, the alignment of the normal n to the source vector s implies that θn = 0◦ with cos θn = 1 and consequently for (5.4) has the maximum reflectance value R( p, q) = πρ L i . When the two vectors are orthogonal, the surface in A is not illuminated having cos θn = π/2 = 0 with reflectance value R( p, q) = 0. The iso-brightness curves represent the set of points that in the gradient space have the different orientations ( p, q) but derive from points that in the image plane (x, y) have the same brightness. In the two previous figures, it can be observed how these curves are different in the two reflectance maps for having simply changed the

422

(a)

5 Shape from Shading

(b) 0.0

(c)

q 0.1 0.2 0.3 0.4 0.5

0.6

0.7

(d) 1 0.8

2.0

0.8

0.6 1.0

1.0 -3.0

-2.0

-1.0

0.4

0.9

0.2

(ps,qs) 1.0

0 2.0

3.0

p

(ps,qs) 1.0

−0.2 −0.4 0.9

--1.0

−0.6

0.8 0.7 0.6 0.5 0.4 0.3 0.2

−0.8 --2.0

θs=90°

−1 −1 −0.8 −0.6 −0.4 −0.2

0

0.2 0.4 0.6 0.8

1

Fig. 5.4 Examples of reflectance maps. a Lambertian spherical surface illuminated by a source placed in the position ( ps , qs ) = (0.7, 0.3); b iso-brightness curves (according to (5.14) where R( p, q) = constant, that is, points of the gradient space that have the same brightness) calculated for the source of the example of figure (a) which is the general case. It is observed that for R( p, q) = 0 the corresponding curve is reduced to a line, for R( ps , qs ) = 1 the curve is reduced to a point where there is the maximum brightness (levels of normalized brightness between 0 and 1), while in the intermediate brightness values the curves are ellipsoidal, parabolic, and then hyperbole until they become asymptotically a straight line (zero brightness). c Lambertian spherical surface illuminated by a source placed at the position ( ps , qs ) = (0, 0) that is at the top in the same direction as the observer. In this case, the iso-brightness curves are concentric circles with respect to the point of maximum brightness (corresponding to the origin (0, 0) of the gradient space). Considering the image irradiance Eq. (5.16), for the different values of the intensity E i , the equations of isobrightness circumferences are given by p 2 +q 2 = (1/E i −1) by virtue of (5.14) for ( ps , qs ) = (0, 0)

direction of illumination, oblique to the observer in a case and coinciding with the direction of the observer in the other case. In particular, two points in the gradient space that lie on the same curve indicate two different orientations of the visible surface that reflect the same amount of light and are, therefore, perceived with the same brightness by the observer even if the local surface is oriented differently in the space. The iso-brilliance curves suggest the non-linearity of Eq. (5.16), which links the radiance to the orientation of the surface. It should be noted that using a single image I (x, y), despite knowing the direction of the source ( ps , qs ) and the reflectance model Rl,s ( p, q), the SfS problem is not solved because with (5.17) it is not possible to calculate the orientation ( p, q) of each surface element in a unique way. In fact, with a single image each pixel has only one value, the luminous intensity E(x, y) while, the orientation of the corresponding patch is defined by the two components p and q of the gradient according to Eq. (5.17) (we have only one equation and two unknowns p and q). Figure 5.4a and b shows how a single intensity of a pixel E(x, y) corresponds different orientations ( p, q) belonging to the same iso-brilliance curve in the map of reflectance. In the following paragraphs, some solutions to the problem of SfS are reported.

5.4 Shape from Shading-SfS Algorithms

423

5.4 Shape from Shading-SfS Algorithms Several researchers [2,3] have proposed solutions to the SfS problem inspired by the visual perception of the 3D surface. In particular, in [3] it is reported that the brain retrieves form information not only from shading, but also from contours, elementary features, and visual knowledge of objects. The SfS algorithms developed in computer vision use ad hoc solutions very different from those hypothesized by human vision. Indeed, the proposed solutions use minimization methods [4] of an energy function, of propagation [1] which extend the shape information from a set of surface points on the whole image, and locale [5] which derives the shape from the luminous intensity assuming the locally spherical surface. Let us now return to the fundamental equation of Shape from Shading, Eq. (5.17), by which we propose to reconstruct the orientation ( p, q) of each visible surface element (i.e., calculate the orientation map of the visible surface) known the irradiance E(x, y), that is, the energy reemitted at each point of the image in relation to the reflectance model. We are faced with the situation of an ill-conditioned problem since we have for each pixel E(x, y) a single equation with two unknowns p and q. This problem can be solved by imposing additional constraints, for example, the condition of visible surface continues in the sense that it has a minimal geometric variability (smooth surface). This constraint implies that, in the gradient space, p and q vary little locally, in the sense that nearby points in the image plane presumably represent orientations whose positions in the gradient space are also very close to each other. From this, it follows that the conditions of continuity of the visible surface will be violated where only great geometric variations occur, which normally occur at the edges and contours. A strategy to solve the image irradiation Eq. (5.17) consists in not finding an exact solution but defining a function to be minimized that includes a term representing the error of the image irradiation equation e I and a term that controls the constraint of the geometric continuity ec . The first term e I is given by the difference between the irradiance of the image E(x, y) and the reflectance function R( p, q):    2 E(x, y) − R( p, q) dxdy (5.18) eI = The second term ec , based on the constraint of the geometric continuity of the surface is derived from the condition that the directional gradients p and q vary very slowly (and to a greater extent their partial derivatives), respectively, with respect to the direction of the x and y. The error ec due to geometric continuity is then defined by minimizing the integral of the sum of the squares of such partial derivatives, and is given by  

(5.19) px2 + p 2y + qx2 + q y2 dxdy ec =

424

5 Shape from Shading

The total error function eT to be minimized, which includes the two terms of previous errors, is given by combining these errors as follows:  

 2 eT = e I + λec = E(x, y) − R( p, q) + λ px2 + p 2y + qx2 + q y2 dxdy (5.20) where λ is the positive parameter that weighs the influence of the geometric continuity error ec with respect to that of the image irradiance. A possible solution to minimize the function of the total error is given with the variational calculation which through an iterative process determines the minimum acceptable solution of the error. It should be noted that the function to be minimized (5.20) depends on the surface orientations p(x, y) and q(x, y) which are dependent on the variables x and y of the image plane. Recall from (5.5) that the irradiance E(x, y) can be represented by the digital image I (i, j), where i and j are the row and column coordinates, respectively, to locate a pixel at location (i, j) containing the observed light intensity, coming from the surface element, whose orientation is denoted in the gradient space with pi j and qi j . Having defined these new symbols for the digital image, the procedure to solve the problem of the Shape from Shading, based on the minimization method, is the following: 1. Orientation initialization. For each pixel I (i, j), initialize the orientations pi0j and qi0j . 2. Equivalence constraint between image irradiance and reflectance map. The luminous intensity I (i, j) for each pixel must be very similar to that produced by the reflectance map derived analytically in the conditions of Lambertianity or evaluated experimentally knowing the optical properties of the surface and the orientation of the source. 3. Constraint of geometric continuity. Calculation of partial derivatives of the reflectance map ( ∂∂ Rp , ∂∂qR ), analytically, when R( pi j , qi j ) is Lambertian (by virtue of Eqs. 5.14 and 5.17), or, estimated numerically by the reflectance map obtained experimentally. 4. Calculation for iterations of the gradient estimation (p, q). Iterative process based on the Lagrange multiplier method which minimizes the total error eT , defined with (5.20), through the following update rules that find pi j and qi j to reconstruct the unknown surface z = Z (x, y):

∂ R = p¯ inj + λ (I (i, j) − R( p¯ inj , q¯inj ) pin+1 j ∂p

∂ R = q¯inj + λ (I (i, j) − R( p¯ inj , q¯inj ) qin+1 j ∂q

(5.21)

5.4 Shape from Shading-SfS Algorithms

(a)

(b)

425

(c)

1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1 −0.8 −0.6 −0.4 −0.2

0

0.2 0.4 0.6 0.8

1

Fig. 5.5 Examples of orientation maps calculated from images acquired with Lambertian illumination with source placed at the position ( ps , qs ) = (0, 0). a Map of the orientation of the spherical surface of Fig. 5.4c. b Image of real objects with flat and curved surfaces. c Orientation map relative to the image of figure b

where p¯ i j and q¯i j denote the values of the mean of pi j and qi j calculated on four local pixels (in correspondence of the location (i, j) in the image plane) with the following equations: pi+1, j + pi−1, j + pi, j+1 + pi, j−1 4 qi+1, j + qi−1, j + qi, j+1 + qi, j−1 q¯i j = 4 p¯ i j =

(5.22)

The iterative process (step 4) continuously updates p and q until the total error eT reaches a reasonable minimum value after several iterations, or stabilizes. It should be noted that, although in the iterative process the estimates of p and q are evaluated locally, a global consistency of surface orientation is realized with the propagation of constraints (2) and (3), after many iterations. Other minimization procedures can be considered to improve the convergence of the iterative process. The procedure described above to solve the problem of Shape from Shading is very simple but presents many difficulties when trying to apply it concretely in real cases. This is due to the imperfect knowledge of the reflectance characteristics of the materials and the difficulties in controlling the lighting conditions of the scene. Figure 5.5 shows two examples of maps of the orientations calculated from images acquired with Lambertian illumination with the source placed at the position ( ps , qs ) = (0, 0). Figure a is relative to the sphere of Fig. 5.4c while that of figure c is relative to a more complex real scene, constituted by overlapping opaque objects whose dominant surface (cylindrical and spherical) presents a good geometric continuity to less than the zones of contour and in the shadow areas (i.e., shading which contributes to an error in reconstructing the shape of the surface) where the intensity levels vary abruptly. Even if the reconstruction of the orientation of the surface does not appear perfect in every pixel of the image, overall, the results of the algorithm of

426

5 Shape from Shading

Shape from Shading are acceptable for the purposes of the perception of the shape of the visible surface. For better visual effect of the results of the algorithm, Fig. 5.5 graphically displays the gradient information ( p, q) in terms of the orientation of the normals (representing the orientation of a surface element) with respect to the observer, seen as oriented segments (perceived as needles oriented). In the literature, such orientation maps are also called needle map. The orientation map together with the depth map (when known and obtained by other methods, for example, with stereo vision) becomes essential for the 3D reconstruction of the visible surface, for example, by combining together, through an interpolation process, the depth and orientation information (a problem that we will describe in the following paragraphs).

5.4.1 Shape from Stereo Photometry with Calibration With the stereo photometry, we want to recover the 3D surface of the objects observed with the orientation map obtained through different images acquired from the same point of view, but illuminated by known light sources with known direction. We have already seen above that it is not possible to derive the 3D shape of a surface using a single image with the SfS approach since there are an indefinite number of orientations that can be associated with the same intensity value (see Fig. 5.4a and b). Moreover, remembering (5.17), the intensity E(x, y) of a pixel has only one degree of freedom while the orientation of the surface has two p and q. Therefore, additional information is needed for calculating the orientation of a surface element. One solution is given by the stereo photometry approach [6] which calculates the ( p, q) orientation of a patch using different images of the same scene, acquired from the same point of view but illuminating the scene from different directions as shown in Fig. 5.6. The figure shows three different positions of the light source with the same observation point, subsequently acquiring images of the scene with different shading. For each image acquisition, only one lamp is on. The different lighting directions lead to different reflectance maps. Now let’s see how the stereo photometry approach solves the problem of the poor conditioning of the S f S approach. Figure 5.7a shows two superimposed reflectance maps obtained as expected by stereo photometry. For clarity, only the iso-brightness curve 0.4 (in red) is superimposed relative to the second reflectance map R2 . We know from (5.14) that the Lambertian reflectance function is not linear as shown in the figure with the isobrightness curves. The latter represent the different ( p, q) orientations of the surface with the same luminous intensity I (i, j) related to each other by (5.17) that we rewrite the following: I (i, j) = E(x, y) = Rl,s ( p, q)

(5.23)

5.4 Shape from Shading-SfS Algorithms

427

Light source 3

120 °



12

Light source 2

Light source 1

120°

Object Plane

Fig. 5.6 Acquisition system for the stereo photometry approach. In this experimentation, three lamps are used positioned at the same height and arranged at 120◦ on the basis of an inverted cone and with the base of the objects to be acquired placed at the apex of the cone

The intensity of the I (i, j) pixels, for the images acquired with stereo photometry, varies for the different local orientation of the surface and for the different orientation of the sources (see Fig. 5.6). Therefore, if we consider two images of the stereo photometry I1 (i, j) and I2 (i, j), according to the Lambertian reflectance model (5.14), we will have two different reflectance maps R1 and R2 associated with the two different orientations s1 and s2 of the sources. Therefore applying to the two images the (5.23), we will have I1 (i, j) = R1l,s1 ( p, q)

I2 (i, j) = R2l,s2 ( p, q)

(5.24)

where l indicates the Lambertian reflectance model. The orientation instead of each element of the visible surface is not modified with respect to the observer. Changing the position of the source in each acquisition changes the angle between the vectors s and n (which, respectively, indicate the orientation of the source and that of the normal to the patch) and this leads to a different reflectance map according to (5.15). It follows that the acquisition system (the observer remains stationary while acquiring the images I1 and I2 ) subsequently sees the same surface element with orientation ( p, q) but with two different values of luminous intensity, respectively, I1 (i, j) and I2 (i, j) due only to different lighting conditions. In the gradient space, the two reflectance maps R1l,s1 ( p, q) and R2l,s2 ( p, q), given by Eq. (5.24), that establish a relationship between the pair of intensity values of the pixels (i, j) and orientation of the corresponding surface element, can be superimposed. In this gradient space (see Fig. 5.7a), with the overlap of the two reflectance

428

5 Shape from Shading

(a)

(b)

I (ps,qs)=(0.7,0.3)

(ps,qs)=(0,0.75)

(ps,qs)=(0.53,-0.36)

q

q 0.0

0.1 0.2 0.3 0.4 0.5

0.6

0.4

0.4

0.7

0.5 2.0

2.0

0.8

P 1.0

1.0

0.9

-1.0

Q

1.0

0.9

1.0

1.0 -2.0

0.8

P

2.0

3.0

p

-2.0

1.0

-1.0

--1.0

--1.0

--2.0

--2.0

2.0

3.0

p

Fig. 5.7 Principle of stereo photometry. The orientation of a surface element is determined through multiple reflectance maps obtained from images acquired with different orientations of the lighting source (assuming Lambertian reflectance). a Considering two images I1 and I2 of stereo photometry, they are associated with the corresponding reflectance maps R1 ( p, q) and R2 ( p, q) which establish a relationship between the pair of intensity values of the pixels (i, j) and orientation of the corresponding surface element. In the gradient space where the reflectance maps are superimposed, the orientation ( p, q) associated with the pixel (i, j) can be determined by the intersection of the two iso-brightness curves which in the corresponding maps represent the respective intensities I1 (i, j) and I2 (i, j). Two curves can intersect in one or two points thus generating two possible orientations. b To obtain a single value ( p(i, j), q(i, j)) of the orientation of the patch, it is necessary to acquire at least another image I3 (i, j) of photometric stereo, superimpose in the gradient space the corresponding reflectance map R3 ( p, q). A unique orientation ( p(i, j), q(i, j)) is obtained with the intersection of the third curve corresponding to the value of the intensity I3 (i, j) in the map R3

maps, the orientation ( p, q) associated with the pixel (i, j) can be determined by the intersection of the two iso-brightness curves which in the corresponding maps represent the respective intensities I1 (i, j) and I2 (i, j). The figure graphically shows this situation where, for simplicity, several isobrightness curves for the normalized intensity values between 0 and 1 have been plotted only for the map R1l,s1 ( p, q), while for the reflectance map R2l,s2 ( p, q) is plotted the curve associated to the luminous intensity I2 (i, j) = 0.4 relative to the same pixel (i, j). Two curves can intersect in one or two points thus generating two possible orientations for the same pixel (i, j) (due to the non-linearity of Eq. 5.15). The figure shows two gradient points P( p1 , q1 ) and Q( p2 , q2 ), the intersection of the two iso-brightness curves corresponding to the intensities I1 (i, j) = 0.9 and I2 (i, j) = 0.4, candidates as a possible orientation of the surface corresponding to the pixel (i, j).

5.4 Shape from Shading-SfS Algorithms

429

To obtain a single value ( p(i, j), q(i, j)) of the orientation of the patch, it is necessary to acquire at least another image I3 (i, j) of stereo photometry always from the same point of observation but with a different orientation s3 of the source. This involves the calculation of a third reflectance map R3l,s3 ( p, q) obtaining a third image irradiance equation: I3 (i, j) = R3l,s3 ( p, q)

(5.25)

The superposition in the gradient space of the corresponding reflectance map R3l,s3 ( p, q) leads to a unique orientation ( p(i, j), q(i, j)) with the intersection of the third curve corresponding to the value of the intensity I3 (i, j) = 0.5 in the map R3 . Figure 5.7b shows the gradient space that also includes the iso-brightness curves corresponding to the third image I3 (i, j) and the gradient point solution is highlighted, i.e., the one resulting from the intersection of the three curves of isobrightness associated with the set of intensity values (I1 , I2 , I3 ) detected for the pixel (i, j). This method of stereo photometry directly estimates, as an alternative to the previous procedure of S f S, the orientation of the surface by observing the scene from the same position but acquiring at least three images in three different lighting conditions. The concept of Stereo Photometry derives from the fact of using several different positions of the light source (and only one observation point) in analogy to the stereo vision that instead observes the scene from different points but in the same lighting conditions. Before generalizing the procedure of stereo photometry, we remember the irradiance equation of the image in the Lambertian context (5.17) made explicit in terms of the reflectance map from (5.16) which also includes the factor that takes into account the reflectivity characteristics of the various materials. The rewritten equation is I (i, j) = R( p, q) =

1 + ps p + q s q ρ(i, j) Li   π 1 + p 2 + q 2 1 + ps2 + qs2

(5.26)

where the term ρ, which varies between 0 and 1, indicates the albedo, i.e., the coefficient that controls in each pixel (i, j) the reflecting power of a Lambertian surface for various types of materials. In particular, the albedo takes into account, in relation to the type of material, how much of the incident luminous energy is reflected toward the observer.4 Now, let’s always consider the Lambertian reflectance model where a source with diffuse lighting is assumed. In these conditions, if L S is the radiance of the source, 4 From

a theoretical point of view, the quantification of the albedo is simple. It would be enough to measure the reflected and incident radiation from a body with an instrument and make the ratio of such measurements, respectively. In physical reality, the measurement of the albedo is complex for various reasons: 1. The incident radiation does not come from a single source but normally comes from different directions;

430

5 Shape from Shading

the radiance received from a patch A of the surface, that is, its irradiance is given by I A = π L S (considering the visible hemisphere). Considered the Lambertian model of surface reflectance (B R D Fl = π1 ), the brightness of the patch, i.e., its radiance L A , is given by L A = B R D Fl · I A =

1 π LS = LS π

Therefore, a Lambertian surface emits the same radiance as the source and its luminosity does not depend on the observation point but may vary, point to point, except for the multiplicative factor of the albedo ρ(i, j). In the Lambertian reflectance conditions, (5.26) suggests that the variables become the orientation of the surface ( p, q) and the albedo ρ. The unit vector of the normal n to the surface is given by Eq. (5.12) while the orientation vector of a source is known, expressed by s = (S X , SY , S Z ). The mathematical formalism of the stereo photometry involves the application of the irradiance Eq. (5.26) for the 3 images Ik (i, j), k = 1, 2, 3 acquired subsequently for the 3 different orientations Sk = (Sk,X , Sk,Y , Sk,Z ), k = 1, 2, 3 of the Lambertian diffuse light sources. Assuming the albedo ρ constant and remembering Eq. (5.14) of the reflectance map for a Lambertian surface illuminated by a generic orientation sk , the image irradiance equation Ik (i, j) becomes Ik (i, j) =

Li Li ρ(i, j)(sk · n)(i, j) = ρ(i, j) cos θk π π

k = 1, 2, 3 (5.27)

where θk represents the angle between the source vector sk and the normal vector n of the surface. (5.27) reiterates that the incident radiance L i , generated by a source in the context of diffused light (Lambertian), is given by the cosine of the angle formed between the vector of the incident light and the unit vector of the surface normal, i.e., the light intensity that reaches the observer is proportional to the inner product of these two unit vectors assuming the albedo ρ constant.

2. the energy reflected by a body is never unidirectional but multidirectional and the reflected energy is not uniform in all directions, and a portion of the incident energy can be absorbed by the same body; 3. the measured reflected energy is only partial due to the angular opening limits of the detector sensor. Therefore, the reflectance measurements are to be considered as samples of the BRDF function. The albedo is considered as a coefficient of global average reflectivity of a body. With the BRDF function, it is possible to model instead the directional distribution of the reflected energy from a body associated with a solid angle.

5.4 Shape from Shading-SfS Algorithms

431

Equation (5.27) of stereo photometry can be expressed in matrix terms to express the system of the three equations as follows: ⎤⎡ ⎡ ⎡ ⎤ ⎤ n X (i, j) I1 (i, j) S11 S12 S13 L i ⎣ I2 (i, j)⎦ = ρ(i, j) ⎣ S21 S22 S23 ⎦ ⎣ n Y (i, j) ⎦ (5.28) π S S S I (i, j) n (i, j) 3

31

32

33

Z

and in compact form we have I=

Li ρ(i, j)S · n π

(5.29)

So we have a system of linear equations to calculate, in each pixel (i, j) from the images Ik (i, j) k = 1, 2, 3 (the triad of measurements of stereo photometry), the orientation of the unit vector n(i, j) normal to the surface, knowing the matrix S which includes the three known directions of the sources (once configured the acquisition system based on stereo photometry) and assuming a constant albedo ρ(i, j) at any point on the surface. Solving with respect to the normal n(i, j) from (5.29), we obtain the following system of linear equations: n=

π −1 S I ρ Li

=⇒

ρn =

π −1 S I Li

(5.30)

Recall that the solution of the linear system (5.30) exists only if the S is nonsingular, that is, the vectors of the sources sk are linearly independent (vectors are not coplanar) and the inverse matrix exists S−1 . It is also assumed that the incident radiant flux of each source has a value of the intensity L i constant. In summary, the stereo photometry approach, with at least three images captured with three different source directions, can compute for each (i, j) pixel, with (5.30) the surface normal vector n = (n x , n y , n z )T , estimated with the expression S−1 I. Recalling (5.10), the gradient components ( p, q) of the surface are then calculated as follows: ny nx q=− (5.31) p=− nz nz and the unit normal vector of the resulting surface is n = (− p, −q, 1)T . Finally, we can also calculate the albedo ρ with Eq. (5.30) and considering that the normal n is a unit vector, we have  π −1 |S I| =⇒ ρ = n 2x + n 2y + n 2z (5.32) ρ= Li If we have more than 3 stereo photometric images you have an overdetermined system having the S matrix of the sources of size m × 3 with m > 3. In this case, the system of Eq. (5.28) in matrix terms becomes 1 I =  b S ·   π mx1

m x 3 3×1

(5.33)

432

5 Shape from Shading

where to simplify b(i, j) = ρ(i, j)n(i, j) indicates the unit vector of the normal scaled from the albedo ρ(i, j) for each pixel of the image and, similarly, the vectors sk direction of the sources are scaled (sk L i ) of the factor of intensity of the incident radiance L i (we assume that the sources have identical radiant intensity). The calculation of normals with the overdetermined system (5.33) can be done with the least squares method which finds a solution of b which minimizes the 2-norm squared of the residual r : min |r |22 = min |I − Sb|22 By developing and calculating the gradient ∇b (r 2 ) and setting it to zero,5 we get b = (SS)−1 ST I

(5.34)

The similar Eqs. (5.33) and (5.34) are known as normal equations which, when solved, lead to the solution of the least squares problem if the matrix S has rank equal to the number of columns (3 in this case) and the problem is well conditioned.6 Once the system is resolved, the normals and the albedo for each pixel are calculated with analogous Eqs. (5.31) and (5.32) replacing (bx , b y , bz ) instead of (n x , n y , n z ) as follows:  =⇒ ρ = b2x + b2y + bz2 (5.35) ρ = π |(SS)−1 ST I| In the case of color images for a given source kth, there are three equations of the image irradiance for every pixel, each relating to each component of color RG B characterized by the albedo ρc : Ik,c = ρc (sk,c · n)(i, j)

(5.36)

where the subscript c indicates the color component R, G and B. Consequently, we would have a system of Eq. (5.34) for each color component: bc = ρc n = (SS)−1 ST Ic

(5.37)

and once calculated bc we get, as before, with (5.32) the value of the albedo ρc relative to the color component considered. Normally, the surface of the objects is

5 In

fact, we have r (b)2 = |I − Sb|22 | = (I − Sb)T (I − Sb) = b T ST Sb − 2bT ST I + IT I

and the gradient ∇b (r 2 ) = 2ST Sb − 2ST I that equaling to zero and resolving with respect to b produces (5.34) with a minimum solution of the residual. 6 We recall that a problem is well conditioned if small perturbations on the measures (the images in the context) generate small variations of the same order also on the quantities to be calculated (the orientations of the normals in the context).

5.4 Shape from Shading-SfS Algorithms

433

reconstructed using gray-level images and it is hardly interesting to work with RGB images for the reconstruction in color of the surface.

5.4.2 Uncalibrated Stereo Photometry Several research activities have demonstrated the use of stereo photometry to extract the map of the normals of objects even in the absence of knowledge of the lighting conditions (without calibration of the light sources or not knowing their orientation sk and intensity L i ) and without knowing the real reflectance model of the same objects. Several results based on the Lambertian reflectance model are reported in the literature. The reconstruction of the observed surface is feasible, albeit still with ambiguity, starting from the map of normals. It also assumes the acquisition of images with orthographic projection and with linear response of the sensor. In this situation, uncalibrated stereo photometry implies that in the system of Eq. (5.33) the unknowns are the source direction matrix S and the vector of normals given by: b(i, j) ≡ ρ(i, j)n(i, j) from which we recall it is possible to derive then the albedo as follows: ρ(i, j) = |b(i, j)|2 A solution to the problem is that of factorizing a matrix [7] which consists in its decomposition into singular values SVD (see Sect. 2.11 Vol. II). In this formulation, it is useful to consider the I = (I1 , I2 , . . . , Im ) matrix of m images, of size m × k (with m ≥ 3), which organizes the pixels of the image Ii , i = 1, m in the ith row of the global image matrix I, each row of dimensions k equal to total number of pixels per image (organization of pixels in lexicographic order). Therefore, the matrix Eq. (5.33), to arrange the factoring of the intensity matrix, is reformulated as follows: S ·  B I =   (m×k)

m×3

(5.38)

3×k

where S is the unknown matrix of the sources whose directions for the ith image are reported by line (si x , si y , si z ), while the matrix of normals B organizes by column the direction components (s j x , s j y , s j z ) for the jth pixel. With this formalism the irradiance equation under Lambertian conditions for the jth pixel of the ith image results in the following: (5.39) Ii,j = Si · Bj    (1×1)

1×3

3×1

434

5 Shape from Shading

For any matrix of size m×, k there always exists its decomposition into singular values, according to the SVD theorem, given by I =  U ·  ·  VT  (m×k)

m×m m×k

(5.40)

k×k

where U is a orthogonal unitary matrix (U T U = I matrix identity) whose columns are the orthonormal eigenvectors of IIT , V is an orthogonal unit matrix whose columns are the orthonormal eigenvectors of IIT and is the diagonal matrix with positive real diagonal elements (σ1 ≥ σ2 · · · ≥ σt ) with t = min(m, k) known as singular values of I. If the I has rank k we will have positive singular values, if it has rank t < k we will have null singular values from (t+1)th value. With the SVD method, it is possible to consider only the first three singular values of the image matrix I having an approximate decomposition of rank 3 of this matrix, considering only the first 3 columns of U, the first 3 rows of V and the first submatrix 3×3 of . The approximation of rank 3 of the image matrix Iˆ results in the following: Iˆ =  U ·  ·  V T  (m×k)

m×3

3×3

(5.41)

3×k

where the submatrices are renamed with the addition of the apex to distinguish them from the originals. With (5.41), we get the best approximation of the original image expressed by (5.40) which uses all the singular values of the complete decomposition. In an ideal context, with images without noise, SVD can well represent the original image matrix with a few singular values. With SVD it is possible to evaluate, from the analysis of singular values, the acceptability of the approximation based on the first three singular values (in the presence of noise the rank of Iˆ is greater of three). Using the first three singular values of , and the corresponding columns (singular ˆ and B, ˆ vectors) of U and V from (5.41), we can define the pseudo-matrices S respectively, of the sources and normals, as follows: √ √ ˆ = U · ˆ = · V T S B (5.42)       (m×3)

m×3

and (5.38) can be rewritten as

(3×k)

3×3

3×3

3×k

ˆB ˆ Iˆ = S

(5.43)

The decomposition obtained with the (5.43) is not unique. In fact, if A is an arbitrary ˆ −1 and AB ˆ invertible matrix with a size of 3 × 3, we have that also the matrices SA are still a valid decomposition of the approximate image matrix Iˆ such that ∀A ∈ G L(3) :

ˆ −1 AB ˆ −1 A)B ˆ −1 ¯¯ ˆ = S(A ˆ ˆ SA    = (SA )(AB) = SB identity

(5.44)

5.4 Shape from Shading-SfS Algorithms

435

where G L(3) is the group of all matrices of size 3 × 3. In essence, with the (5.44) we establish an equivalence relation in the space of the solutions T = IRm×3 × IR3×k , ˆ of direction of the where IRm×3 represents the space of all possible matrices S m×3 3×k the space of all possible matrices IR of scaled normals. sources and IR ˆ and B ˆ can be The ambiguity generated by the factorization Eq. (5.44) of Iˆ in S managed by considering the matrix A associated with a linear transformation such that ¯ = SA ˆ −1 ¯ = AB ˆ B (5.45) S ˆ B) ¯ B) ˆ ∈ T and (S, ¯ ∈ T are Equation (5.45) tells us that the two solutions (S, equivalent if exists the matrix A ∈ G L(3). With SVD through Eq. (5.42), the matrix of the stereo photometric images I can select an equivalence class T (I). This class contains the matrix S of the true sources direction and the matrix B of true normals scaled, but it is not possible to distinguish (S, B) from other members in T (I) based only on the content of the images assembled in the images matrix I. The A matrix can be determined with at least 6 pixels with the same or known reflectance or considering that the intensity of at least six sources is constant or is known [7]. It is shown [8] that by imposing the constraint of integrability7 ambiguity can be reduced by introducing the Generalized Bas-Relief (GBR) transformations which satisfy the integrability constraint. GBR transforms a surface z(x, y) into a new surface zˆ (x, y) combining a flattening operation (or a scale change) along the z-axis with the addition of a plane: zˆ (x, y) = λz(x, y) + μx + νy

(5.46)

where λ = 0, ν, μ ∈  are the parameters that represent the group of the GBR transformations. The matrix A which solves Eq. (5.45) is given by the matrix G associated with the group of GBR transformations given in the form: ⎡ ⎤ 1 00 A = G = ⎣ 0 1 0⎦ (5.47) μνλ

7 The constraint of integrability requires that the normals estimated by stereo photometry correspond

to a curved surface. Recall that from the orientation map the surface z(x, y) can be reconstructed by integrating the gradient information { p(x, y), q(x, y)} or the partial derivatives of z(x, y) along any path between two points in the image plane. For a curved surface, the constraint of integrability [9] means that it does not matter the path chosen to have an approximation of zˆ (x, y). Formally, this means ∂ 2 zˆ ∂ 2 zˆ = ∂ x∂ y ∂ y∂ x ˆ having already estimated the normals b(x, y) = (bˆ x , bˆ y , bˆ z )T with partial derivatives of the first order

∂ zˆ ∂x

=

bˆ x (x,y) ∂ zˆ ; bˆ z (x,y) ∂ y

=

bˆ y (x,y) . bˆ z (x,y)

436

5 Shape from Shading

¯ and the pseudoThe GBR transformation defines the sources pseudo-orientations S ¯ according to Eq. (5.45) replacing the matrix A with the matrix G. Resolvnormals B ˆ and B, ¯ we have ing then with respect to S ⎡ ⎤ λ 0 0 1 ˆ = SG ¯ ¯ ¯ = ⎣ 0 λ 0⎦ B ˆ = G−1 B (5.48) S B λ −μ −ν 1 Thus, the problem is reduced from the 9 unknowns of the matrix A of size 3×3 to the 3 parameters of the matrix G associated with the GBR transformation. Furthermore, the property of the GBR transformations is unique, in the sense that it does not alter the shading configurations of a surface z(x, y) (with Lambertian reflectance) illuminated by any source s with respect to that of the surface zˆ (x, y) obtained from the GBR transformation with G and illuminated by the source whose direction is given by sˆ = G −T s. In other words, when the orientations of both surface and source are transformed with the matrix G, the shading configurations are identical in the images of origin and that of the transformed surface.

5.4.3 Stereo Photometry with Calibration Sphere The reconstruction of the orientation of the visible surface can be carried out experimentally using a look-up table (LUT), which associates the orientation of the normal with a triad of luminous intensity (I1 , I2 , I3 ) measured after appropriately calibrating the acquisition system. Consider as a stereo photometry system the one shown in Fig. 5.6 which provides the three sources arranged on the basis of a cone at 120◦ degrees between them, a single acquisition post located at the center of the cone base, and the work plane placed at the top of the inverted cone where the objects to be acquired are located. Initially, the system is calibrated to consider the reflectivity component of the material of the objects and to acquire all the possible intensity triads to be stored in the LUT table to be associated with a given orientation of the surface normal. To be in Lambertian reflectance conditions, objects made of opaque material are chosen, for example, objects made of PVC plastic material. Therefore, the calibration of the system is carried out by means of a sphere of PVC material (analogous to the material of the objects), for which the three images I1 , I2 , I3 provided with stereo photometry are acquired. The images are assumed to be registered (images are aligned with each other), that is, during the three acquisitions, the object and observer are stopped while the corresponding sources are turned on in succession. The calibration process associates, for each pixel (i, j) of the image, a set of luminous intensity values (I1 , I2 , I3 ) (the three measurements of stereo photometry) as measured by the camera (single observation point) in the three successive acquisitions while a single source is operative, and the orientation value ( p, q) of the calibration surface is derived from the knowledge of the geometric description of the sphere.

5.4 Shape from Shading-SfS Algorithms 120°

437 240°

360°

T R

Calibration sphere

Lookup Table

(p,q)

Fig. 5.8 Stereo photometry: calibration of the acquisition system using a sphere of material with identical reflectance properties of the objects

In essence, the calibration sphere is chosen as the only solid whose visible surface has all the possible orientations of a surface element in the space of the visible hemisphere. In Fig. 5.8 are indicated on the calibration sphere two generic surface elements centered in the points R and T visible by the camera and projected in the image plane, respectively, in the pixels located in (i R , j R ) and (i T , jT ). The orientation of the normals corresponding to these superficial elements of the sphere is given by n R ( p R , q R ) and n T ( pT , qT ) calculated analytically knowing the parametric equation of the hypothesized sphere of unit radius. Once the projections of these points of the sphere in the image plane are known, after the acquisition of the three stereo photometry images {I1 , I2 , I3 }spher e , it is possible to associate, to the surface elements considered R and T , their normals orientation (knowing the geometry of the sphere) and the triad of luminous intensity measurements determined by the camera. These associations are stored in the LUT table of dimensions (3 × 2 × m) using as pointers the triples of the luminous intensity measurements to which the values of the two orientation components ( p, q) are made to correspond. The number of triples m depends on the level of discretization of the sphere or the resolution of the images. In the example shown in Fig. 5.8, the associations for the superficial elements R and T are, respectively, the following:  I1 (i R , j R ), I2 (i R , j R ), I3 (i R , j R ) =⇒ n R ( p R , q R )  I1 (i T , jT ), I2 (i T , jT ), I3 (i T , jT ) =⇒ n T ( pT , qT ) Once the system has been calibrated and chosen the type of material with Lambertian reflectance characteristics, positioned the lamps appropriately, stored all the associations orientations ( p, q), and measured triples (I1 , I2 , I3 ) with the sphere of calibration, the latter is removed from the acquisition plan and replaced with the objects to be acquired.

438

5 Shape from Shading

Lookup Table (p,q)

Orientation map

Fig. 5.9 Stereo photometric images of a real PVC object with the final orientation map obtained by applying the calibrated stereo photometry approach with the PVC sphere

The three stereo photometry images of the objects are acquired in the same conditions in which those of the calibration sphere were acquired, that is, by making sure that the image Ik is acquired keeping only the source Sk on. For each pixel (i, j) of the image, the triad (I1 , I2 , I3 ) is used as a pointer in the LUT table this time used not to store but to find the orientation ( p, q) to be associated with the surface corresponding to the pixel located in (i, j). Figure 5.9 shows the results of the process of stereo photometry, which builds the orientation map (needle map) of an object of the same PVC material as the calibration sphere. Figure 5.10 instead shows another experiment based on stereo photometry to detect the best-placed object, in the stack of similar objects (identical to the one in the previous example), for automatic gripping (known as a problem of the bin picking) in the context of robotic cells. In this case, once the map of the orientation of the stack has been obtained, it is segmented to isolate the object best placed for gripping and to estimate the attitude of the isolated object (normally, it is the one at the top in the stack).

5.4.3.1 Calculation of the Direction of the Sources with Reverse Stereo Photometry The stereo photometric images of the calibration sphere, with Lambertian reflectance and uniform albedo, are used to derive the direction of the three sources si applying the inverse stereo photometry with Eq. (5.34). In this case, the unknowns are the vectors si the direction of the sources while the directions of the normal are known b knowing the geometry of the sphere and the 3 stereo photometric images of the same. The sphere normals are calculated for different pixels in the image plane included

5.4 Shape from Shading-SfS Algorithms

439

tt

ta

jec

Ob he ft po to he ck

sta

Orientation map

Segmented map

Fig. 5.10 Stereo photometric images of a stack of objects identical to that of the previous example. In this case, with the same calibration data, the calculated orientation map is used not for the purpose of reconstructing the surface but to determine the attitude of the object higher up in the stack

in the iso-brightness curves (see Fig. 5.11). Once the direction of the sources is estimated, the value of the albedo can be estimated according to (5.32).

5.4.4 Limitations of Stereo Photometry The traditional stereo photometry approach is strongly conditioned by the assumption of the Lambertian reflectance model. In nature, it is difficult to find ideally Lambertian materials and normally have a specular reflectance component. Another aspect not considered by Eq. (5.28) of stereo photometry is that of the shadows present in the images. One way to mitigate these problems is to use a M > 3 number of stereo photometric images when possible, making the system overdetermined. This involves a greater computational load especially for large images, considering that the calculation of the albedo and the normals must be done for each pixel. Variants of the traditional stereo photometry approach have been proposed using multiple sources to consider non-Lambertian surfaces. In [10,11], we find a method based on the assumption that highly luminous spots due to specular reflectance do not overlap in the photometric stereo images. In the literature, several works on noncalibrated stereo photometry are reported which consider other optimization models to estimate the direction of sources and normals in the context of the Lambertian and

440

5 Shape from Shading

Fig. 5.11 Stereo photometric image of the calibration sphere with iso-intensity curves

non-Lambertian reflectance model, considering the presence in stereo photometric images of the effects of the specular reflectance model and the problem of shadows including those generated by mutual occlusion between observed objects.

5.4.5 Surface Reconstruction from the Orientation Map Obtained with the stereo photometry the map of the orientations (unitary normals) of the surface for each pixel of the image, it is possible to reconstruct the surface z = Z (x, y), that is, to determine the map of depth through a data integration algorithm. In essence, it requires a transition from the gradient space ( p, q) to the depth map to recover the surface. The problem of surface reconstruction, starting from the discrete gradient space with noisy data (often the surface continuity is violated, a constraint imposed by stereo photometry), is an ill-posed problem. In fact, the estimated normal surfaces do not faithfully reproduce the local curvature (slope) of the surface itself. It can happen that more surfaces with different height values can have the same gradients. A check on the acceptability of the estimated normals can be done with the integrability test (see Note 7) which evaluates at each point the value ∂∂ py − ∂q ∂ x which should be theoretically zero but small values would be acceptable. Once this test has been passed, the reconstruction of the surface occurs less than a constant additive value of the heights and with an adequate depth error. One approach to constructing the surface is to consider the gradient information ( p(x, y), q(x, y)) which gives the height increments between adjacent points of the surface in the direction of the x- and y-axes. Therefore, the surface is constructed by

5.4 Shape from Shading-SfS Algorithms

441

adding these increments starting from a point and following a generic path. In the continuous case, by imposing the integrability constraint, integration along different paths would lead to the same value of the estimated height for a generic point (x, y) starting from the same initial point (x0 , y0 ) with the same arbitrary height Z 0 . This reconstruction approach is called local integration method. A global integration method is based on a C cost function that minimizes the quadratic error between the ideal gradient (Z x , Z y ) and the estimated ( p, q):   C= (|Z x − p|2 + |Z y − q|2 )dxdy (5.49) 

where  represents the domain of all the points (x, y) of the map of the normals N(x, y), while Z x and Z y are the partial derivatives of the ideal surface Z (x, y) with respect to the respective axes x and y. This function is invariant when a constant value is added to the function of the height’s surface Z (x, y). The optimization problem posed by (5.49) can be solved with the variational approach, with the direct discretization method or with the expansion methods. The variational approach [12] uses the Euler–Lagrange equation as the necessary condition to reach a minimum. The numerical solution to minimize (5.49) is realized with the conversion process from continuous to discrete. The expansion methods instead are set by expressing the function Z (x, y) as a linear combination of a set of basic functions.

5.4.5.1 Local Method Mediating Gradients This local surface reconstruction method is based on the average of the gradients between adjacent normals. From the normal map, we consider a 4-point grid and indicate the normals and the respective surface gradients as follows: nx,y = [ p(x, y), q(x, y)]

nx+1,y = [ p(x + 1, y), q(x + 1, y)]

nx,y+1 = [ p(x, y + 1), q(x, y + 1)]nx+1,y+1 = [ p(x + 1, y + 1), q(x + 1, y + 1)] Now let’s consider the normals of the second column of the grid (along the x-axis). The line connecting these points z[x, y +1, Z (x, y +1)] e z[x +1, y +1, Z (x +1, y + 1)] of the surface is approximately perpendicular to the normal average between these two points. It follows that the inner product between the vector (slope) of this line and the average normal vector is zero. This produces the following: 1 Z (x + 1, y + 1) = Z (x, y + 1) + [ p(x, y + 1) + p(x + 1, y + 1)] 2

(5.50)

Similarly, considering the adjacent points of the second row of the grid (along the y-axis), that is, the line connecting the points z[x + 1, y, Z (x + 1, y)] and z[x + 1, y + 1, Z (x + 1, y + 1)] of the surface, we obtain the relation: 1 Z (x + 1, y + 1) = Z (x + 1, y) + [q(x + 1, y) + q(x + 1, y + 1)] 2

(5.51)

442

5 Shape from Shading

Adding member to member the two relations and dividing by 2, we have 1 1 [Z (x, y + 1) + Z (x + 1, y)] + [ p(x, y + 1) 2 4 + p(x + 1, y + 1) + q(x + 1, y) + q(x+1, y + 1)] (5.52) Essentially, (5.52) estimates the value of the height of the surface at the point (x + 1, y + 1) by adding, to the average of the heights of the diagonal points (x, y + 1) and (x + 1, y) of the grid considered, the increments of the heights expressed by the gradients mediated in the direction of the x- and y-axis. Now consider a surface that is discretized with a gradient map of Nr × Nc points (consisting of Nr rows and Nc columns). Let Z (1, 1) and Z (Nr , Nc ) be the initial arbitrary values of the heights in the extreme points of the gradient map, then a two-scan process of the gradient map can determine the values of the heights along the axis of the x and y discretizing with the constraint of integrability in terms of the differences forward, as follows: Z (x + 1, y + 1) =

Z (x, 1) = Z (x − 1, 1) + p(x − 1, 1)

Z (1, y) = Z (1, y − 1) + q(1, y − 1) (5.53) where x = 2, . . . , Nr , y = 2, . . . , Nc , and the map is scanned vertically using the local increments defined with Eq. (5.52). The second scan starts from the other end of the gradient map, the point (Nr , Nc ), and calculates the values of the heights, as follows: Z (x − 1, Nc ) = Z (x, Nc ) − p(x, Nc )

Z (Nr , y − 1) = Z (Nr , y) − q(Nr , y) (5.54) and the map is scanned horizontally with the following recursive equation: Z (x − 1, y − 1) =

1 1 [Z (x − 1, y) + Z (x, y − 1) − [ p(x − 1, y) + p(x, y) + q(x, y − 1) + q(x, y)] 2 4

(5.55)

The map of heights thus estimated has the values influenced by the choice of the initial arbitrary values. Therefore, it is useful to perform a final step by taking the average of the values of the two scans to obtain the final map of the surface heights. Figure 5.12 shows the height map obtained starting from the map of the normals of the visible surface acquired with the calibrated stereo photometry.

5.4.5.2 Local Method Based on Least Squares Another method of surface reconstruction [13], starts from the normal map and considers the gradient values ( p, q) of the surface given by Eq. (5.31) which can be expressed in terms of partial derivatives (5.10) of the height map Z (x, y) along the x- and y-axis. Such derivatives can be approximated to the forward differences as follows: p(x, y) ≈ Z (x + 1, y) − Z (x, y) (5.56) q(x, y) ≈ Z (x, y + 1) − Z (x, y)

5.4 Shape from Shading-SfS Algorithms

443

Fig. 5.12 Results of the reconstruction of the surface starting from the orientation map obtained from the calibrated stereo photometry

It follows that for each pixel (x, y) of the height map, a system of equations can be defined by combining (5.56) with the Z (x, y) surface derivatives represented with the gradient according to (5.31) obtaining n z (x, y)Z (x + 1, y) − n z Z (x, y) = −n x (x, y) n z (x, y)Z (x, y + 1) − n z Z (x, y) = −n y (x, y)

(5.57)

where n(x, y) = (n x (x, y), n y (x, y), n z (x, y) is the normal vector of the point (x, y) of the normal map in 3D space. If the map includes M pixel, the complete equation system (5.57) consists of 2M equations. To improve the estimate of Z (x, y) for each pixel of the map can be extended the system (5.57) considering also the adjacent pixels, respectively, the one on the left (x − 1, y) and the one above (x, y − 1) with respect to the pixel (x, y) being processed. In that case, the previous system extends to 4M equations. The system (5.57) can be solved as an overdetermined linear system. It should be noted that Eq. (5.57) are valid for points not belonging to the edges of the objects where the component n z → 0.

5.4.5.3 Global Method Based on the Fourier Transform We have previously introduced the function (5.49) as global integration method based on a cost function C(Z ) that minimizes the quadratic error between the ideal gradient (Z x , Z y ) and that estimated ( p, q) to obtain the map of heights Z (x, y) starting from the orientation map (expressed in the gradient space ( p, q)) derived by the SFS method. To improve the accuracy of the surface height map to the function to be minimized (5.49), an additional constraint is added that makes stronger the relationship between height values Z (x, y) to be estimated with gradient values ( p(x, y), q(x, y)).

444

5 Shape from Shading

This additional constraint imposes the equality of the second partial derivatives Z x x = px and Z yy = q y , and the cost function to be minimized becomes   C(Z ) = (|Z x − p|2 +|Z y −q|2 )+λ0 (|Z x x − px |2 +|Z yy −q y |2 )dxdy (5.58) 

where  represents the domain of all the points (x, y) of the map of the normals N(x, y) = ( p(x, y), q(x, y)) and λ0 > 0 controls how to tune the curvature of the surface and the variability of the acquired gradient data. The constraint of integrability still remains p y = qx ⇔ Z x y = Z yx . An additional constraint can be added with the term smoothing (smoothness) and the new function results in the following:   (|Z x − p|2 + |Z y − q|2 )dxdy + λ0 (|Z x x − px |2 + |Z yy − q y |2 )dxdy       + λ1 (|Z x |2 + |Z y |2 )dxdy + λ2 (|Z x x |2 + 2|Z x y |2 + |Z yy |2 )dxdy  

C(Z ) =



(5.59)



where λ1 and λ2 are two additional nonnegative parameters that control the smoothing level of the surface and its curvature, respectively. The minimized cost function C(Z ) which estimates the surface Z (x, y) unknown is solvable using two algorithms both based on the Fourier transform. The first algorithm [12] is set as a minimization problem expressed by the function (5.49) with the constraint of integrability. The proposed method uses the theory of projection8 on convex sets. In essence, the gradient map of the normal N (x, y) = ( p(x, y), q(x, y)) is projected into the gradient space that can be integrated in the sense of least squares, then using the Fourier transform for optimization in the frequency domain. Consider the surface Z (x, y) represented by the functions φ(x, y, ω) as follows:  K (ω)φ(x, y, ω) (5.60) Z (x, y) = ω∈

where ω is a 2D map of indexes associated with a finite set , and the coefficients K (ω) that minimize the function (5.49) can be expressed as K (ω) =

8 In

Px (ω)K 1 (ω) + Py (ω)K 2 (ω) Px (ω) + Py (ω)

(5.61)

mathematical analysis, the projection theorem, also known as projection theorem in Hilbert spaces, which descends from the convex analysis, is often used in functional analysis, establishing that for every point x in a space of Hilbert H and for each convex set closed C ⊂ H there exists a single value y ∈ C such that the distance  x − y  assumes the minimum value on C. In particular, this is true for any closed subspace M of H . In this case, a necessary and sufficient condition for y is that the vector x − y is orthogonal to M.

5.4 Shape from Shading-SfS Algorithms

445

  where Px (ω) = |φx (x, y, ω)|2 and Py (ω) = |φ y (x, y, ω)|2 . The Fourier derivatives of the basic functions φ can be expressed as follows: φx = jωx φ

φ y = jω y φ

(5.62)

whereas we have that derivatives Px ∝ ω2x , Py ∝ ω2y , and also we can have that K (ω)

y x (ω) and K 2 (ω) = jω . Expanding the surface Z (x, y) with the Fourier K 1 (ω) = Kjω x y basic functions, then the function (5.49) is minimized with

K (ω) =

jωx K x (ω) − jω y K y (ω) ω2x + ω2y

(5.63)

where K x (ω) and K y (ω) are the Fourier coefficients of the heights of the reconstructed surface. These Fourier coefficients can be calculated from the following relationships: K x (ω) = F { p(x, y)}

K y (ω) = F {q(x, y)}

(5.64)

where F represents the Fourier transform operator applied to gradient maps. The results of the integration and the height map associated with the map of the normals N (x, y) is obtained by applying the inverse Fourier transform given by Z (x, y) = F −1 {K (ω)}

(5.65)

The method described, although based on theoretical foundations by virtue of the projection theorem, the surface reconstruction is sensitive to the noise present in the normal map acquired with stereo photometry. In particular, the reconstruction has errors in the discontinuity areas of the map where the integrability constraint is violated. The second algorithm [14] in addition to the integrability constraint adds with the second derivatives the surface smoothing constraints minimizing the C(Z ) function given by (5.59) using the Discrete Fourier Transform (DFT). The latter, applied to the surface function Z (x, y) is defined by N  c −1 N r −1  1 − j2π u Nxc +v Nyr Z (x, y)e Z(u, v) = √ Nr Nc x=0 y=0

(5.66)

and the inverse transform is given as follows: Nc −1 N  r −1 1  − j2π x Nuc +y Nvr Z(u, v)e Z (x, y) = Nr Nc

(5.67)

u=0 v=0

where √ the transform is calculated for each point on the normal map ((x, y) ∈ ), j = −1 is the imaginary unit, and u and v represent the frequencies in the Fourier domain. We now report the derivatives of the function Z (x, y) in the spatial and

446

5 Shape from Shading

frequency domain by virtue of the properties of the Fourier transform: Z x (x, y) ⇔ juZ(u, v) Z y (x, y) ⇔ jvZ(u, v) Z x x (x, y) ⇔ −u 2 Z(u, v) Z yy (x, y) ⇔ −v2 Z(u, v) Z x y (x, y) ⇔ −uvZ(u, v)

(5.68)

We also consider the Rayleigh theorem9 : 1 Nr Nc





|Z (x, y)|2 =

(x,y)∈

|Z(u, v)|2

(5.69)

(u,v)∈

which establishes the equivalence of two representations (the spatial one and the frequency domain) of the function Z (x, y) from the energetic point of view useful in this case to minimize the energy of the function (5.59). Let P(u, v) and Q(u, v) be the Fourier transform of the gradients p(x, y) and q(x, y), respectively. Applying the Fourier transform to the function (5.59) and considering the energy theorem (5.69), we obtain the following: 

| juZ(u, v) − P(u, v)|2 + | jvZ(u, v) − Q(u, v)|2 (u,v)∈

+ λ0



| − u 2 Z(u, v) − juP(u, v)|2 + | − v2 Z(u, v) − jvQ(u, v)|2

(u,v)∈

+ λ1



| juZ(u, v)|2 + | jvZ(u, v)|2

(u,v)∈

+ λ2



| − u 2 Z(u, v)|2 + 2| − uvZ(u, v)|2 + | − v2 Z(u, v)|2

(u,v)∈

=⇒ minimum

fact, Rayleigh’s theorem is based on Parseval’s theorem. If x1 (t) and x2 (t) are two real signals, X1 (u) and X2 (u) are the relative Fourier transforms, for Parseval’s theorem proves that:  +∞  +∞ x1 (t) · x2∗ (t)dt = X1 (u) · X∗2 (u)du

9 In

−∞

−∞

If x1 (t) = x2 (t) = x(t) then we have the Rayleigh theorem or energy theorem:  +∞  +∞ E= |x(t)|2 dt = |X(u)|2 (u)du −∞

−∞

the asterisk indicates the complex conjugate operator. Often used to calculate the energy of a function (or signal) in the frequency domain.

5.4 Shape from Shading-SfS Algorithms

447

By expanding the previous expressions, we have 

u 2 ZZ∗ − juZP∗ + juZ∗ P + PP∗ (u,v)∈

+ λ0

+ v2 ZZ∗ − jvZQ∗ + jvZ∗ Q + QQ∗ 

u 4 ZZ∗ − ju 3 ZP∗ + ju 3 Z∗ P + u 2 PP∗

(u,v)∈

+ λ1



+ v4 ZZ∗ − jv3 ZQ∗ + jv3 Z∗ Q + v2 QQ∗



(u 2 + v2 )ZZ∗

(u,v)∈

+ λ2



(u 4 + 2u 2 v2 + v4 )ZZ∗

(u,v)∈

where the asterisk denotes the complex conjugate operator. By differentiating the latter expression with respect to Z∗ and setting the result to zero, it is possible to impose the necessary condition to have a minimum of the function (5.59) as follows. (u 2 Z + juP + v2 Z + jvQ) + λ0 (u 4 Z + ju 3 P + v4 Z + jv3 Q) + λ1 (u 2 + v2 )Z + λ2 (u 4 + 2u 2 v2 + v4 )Z = 0 and reordering this last equation we have

λ0 (u 4 + v4 ) + (1 + λ1 )(u 2 + v2 ) + λ2 (u 2 + v2 )2 Z(u, v) + j (u + λ0 u 3 )P(u, v) + j (v + λ0 v3 )Q(u, v) = 0 Solving the above equation except for (u, v) = (0, 0), we finally get Z(u, v) =

− j (u + λ0 u 3 )P(u, v) − j (v + λ0 v3 )Q(u, v) λ0 (u 4 + v4 ) + (1 + λ1 )(u 2 + v2 ) + λ2 (u 2 + v2 )2

(5.70)

Therefore, with (5.70), we have arrived at the Fourier transform of the heights of an unknown surface starting from the Fourier transforms P(u, v) and Q(u, v) of the gradient maps p(x, y) and q(x, y) calculated with stereo photometry. The details of the complete algorithm are reported in [15].

5.5 Shape from Texture When a scene is observed, the image captures, in addition to the variation information of light intensity (shading), even if present, the texture information. With Shape From Texture—SFT, we mean the vision paradigm that analyzes the texture information

448

5 Shape from Shading

Fig. 5.13 Images of objects with a textured surface

present in the image and produces a qualitative or quantitative reconstruction of the 3D surface of the objects observed. For the purpose of surface reconstruction, we are interested in a particular texture information: micro- and macro-repetitive structures (texture primitives) with strong geometric and radiometric variations present on the surface of objects. The ability of the SFT algorithms consists precisely in the automatic search from the image of the texture primitives normally characterized by the shape information (for example, ellipses, rectangles, circles, etc.), of size, density (number of primitives present in an area of the image), and their orientation. The goal is to reconstruct the 3D surface or calculate its orientation with respect to the observer by analyzing the characteristic information of the texture primitives. This is in accordance with Gibson’s theory of human perception, which suggests the perception of the surface of the scene to extract depth information and the 3D perception of objects. Gibson emphasizes that the presence of regular texture in the environment for its perception is fundamental to humans. Figure 5.13 shows the images of a flat and cylindrical surface with the presence in both images of the same texture primitive in the form of replicated disks. It is observed that in relation to the configuration of the vision system (position of the camera, source, and attitude of the visible surface) in the acquired image the texture primitives are projected undergoing a systematic distortion (due to the perspective projection) in relation also to the attitude of the visible surface with respect to the observer. From the analysis of these detected geometric distortions of the primitives, it is possible to evaluate the structure of the observed surface or perform the 3D reconstruction of the surface itself. From Fig. 5.13 we observe how, in the case of regular primitives (in the example disks and ellipses), their geometric variations (shape and size) in the image depend on their distance from the observer, and from the orientation of the surface containing these primitives. Gibson claims that in the 2D image invariant scene information is captured. An example of invariant information is given by the relationship between the horizontal and vertical dimension of a replicated primitive that remains constant regardless of the fact that, as they move away from the observer, they become smaller in size until they disappear on the horizon. The goal is to determine this invariant information from the acquired 2D image. Now let’s see what are the parameters to consider to characterize the texture primitives present in the images. These parameters are linked to the following factors:

5.5 Shape from Texture

449

Perspective projection. It introduces distortions to the geometry of primitives by changing their height and width. In particular, as the texture primitives move away from the observer, they appear smaller and smaller in the image (the effect of the tracks appearing to converge on the horizon is known). In the hypothesis of diskshaped primitives, these become more and more ellipses, when they move away from the observer. Surface inclination. Typically, when observing a flat surface inclined with respect to the observer, the texture primitives contained appear observed foreshortened to the observer or compressed in the inclination direction. For example, a circle that is not parallel to the image plane is seen foreshortened, that is, it is projected as an ellipse in the image plane. Any method of Shape From Texture must then evaluate the geometric parameters of the texture primitives characterized by these two distortions, which are essential for the reconstruction of the surface and the calculation of its structure. The orientation of a plan must be estimated starting from the knowledge of the geometry of the texture, from the possibility of extracting these primitives without ambiguity, and appropriately estimating the invariant parameters of the geometry of the primitives such as: relationships between horizontal and vertical dimension, variations of areas, etc. In particular, by extracting all the primitives present, it is possible to evaluate invariant parameters such as the texture gradient which indicates the rapidity of change of the density of these by the observer. In other words, the texture gradient in the image provides a continuous metric of the scene, analyzing the geometry of the primitives that always appear smaller, as they move away from the observer. The information measured with the texture gradient allows humans to perceive the orientation of a flat surface, the curvature of a surface and the depth. In Fig. 5.14 are shown some images, where it is shown how the texture gradient information gives the perception of the depth of the primitives on the flat surface that move away from the observer, and how locally the visible surface changes from the change of the texture gradient. Other information considered is the perspective gradient and the compression gradient, defined, respectively, by the change in the width and height of the projec-

Fig. 5.14 Texture gradient: depth perception and surface orientation

450

5 Shape from Shading

Fig. 5.15 Geometry of the projection model between image plane and local orientation of the defined surface in terms of angle of inclination σ and rotation τ of the normal n to the surface projected in the image plane

Y X τ

n

O σ

P

e lan

Z

eP

ag

Im

Surface element

tions of the texture primitives in the image plane. As the distance increases between observer and points of the visible surface, the gradient of perspective and compression decrease with distance. This perspective and compression gradient information has been widely used in computational graphics to give a good perception of the 3D surface observed on a monitor or 2D screen. In the context Shape From Texture, it is usual to define the structure of the flat surface to be reconstructed, with respect to the observer, through the slant angle σ which indicates the angle included between the normal vector on the flat surface and the z-axis (coinciding with the optical axis) and through the tilt angle τ indicating the angle between the X -axis and the projection vector, in the image plane, of the normal vector n (see Fig. 5.15). The figure shows how the slant angle is such that the textured flat surface is inclined with respect to the observer in such a way that the upper part is further away while the tilt angle is zero and, consequently, all the texture primitives which are arranged horizontally are all the same distance from the observer. A general algorithm of Shape From Texture includes the following essential steps: 1. De f ine the texture primitives to be considered for the given application (lines, disks, ellipses, rectangles, curved lines, etc.). 2. Choose the invariant parameters (texture, perspective, and compression gradients) appropriate for the texture primitives defined in step 1. 3. U se the invariant parameters of step 2, to calculate the attitude of the textured surface.

5.6 Shape from Structured Light

451

Fig. 5.16 Illustration of the triangulation geometry between projector, camera, and 3D object

3D object

Z

P

S

γ h

d β O

Camera

Light source (laser)

α L

Q

X

5.6 Shape from Structured Light A depth map can be obtained with a range imaging system10 , where the object to be reconstructed is illuminated by the so-called structured lighting of which the geometry of projected geometric structures is known. In essence, remembering the binocular vision system, a camera is replaced by a luminous pattern projector and the problem of correspondence is solved (in a simpler way) by searching for the (known) patterns in the camera that captures the scene with overlapping light patterns. Figure 5.16 shows the functional scheme of a range acquisition device based on structured light. The scene is illuminated with a project (for example, based on lowpower lasers) by known patterns of light (structured light) and the observer (camera) are separated at a distance L and the distance measurement (range) can be calculated with a single image (scene with overlapping light patterns) by triangulation in a similar way to the stereo binocular system. Normally the scene can be illuminated by a luminous spot or by a thin lamina of light (vertical light plane perpendicular to the scene) or with more complex luminous patterns (for example, a rectangular or square luminous grid, binary luminous strips or gray; Microsoft’s Kinect is a lowcost device that projects with a laser scanner scattered in the infrared and uses an infrared-sensitive camera). The relation between the coordinates (X, Y, Z ) of a point P of the scene and those (x, y) of its projection in the image plane is linked to the calibration parameters of the capture system such as, the focal f of the camera’s optical system, the separation distance L between projector and camera, from the angle of inclination α of the projector with respect to the axis of X and from the projection angle β of the object’s point P illuminated by the light spot (see Fig. 5.16). In the hypothesis of the 2D

10 Indicates

a set of techniques that are used to produce a 2D image to calculate the distance of points in a scene from a specific point, normally associated with a particular sensory device. The pixels of the resulting image, known as the depth image, have the information content from which to extrapolate values of distances between points of the object and sensory device. If the sensor that is used to produce the depth image is correctly calibrated, the pixel values are used to estimate the distance information as in a stereo binocular device.

452

5 Shape from Shading

Fig. 5.17 Illustration in the 3D extension of triangulation geometry between projector, camera, and 3D object

3D object

P(X,Y,Z) Z Y

Y Z

γ

f

y p(x,y) Image Plane

Light source (laser)

x α

X

O

L

Q

X

projection of a single light spot, this relation is determined to calculate the position 

of P, by triangulation, considering the triangle O P Q and applying the law of sines: d L = sin α sin γ from which follows: d=

L · sin α L · sin α L · sin α = = sin γ sin[π − (α + β)] sin(α + β)

(5.71)

The angle β (given by β = arctan( f /x)) is determined by the projection geometry of the point P in the image plane located in p(x, y) considering the focal length f of the optical system and the only horizontal coordinate x. Determined the angle β, known by the system configuration the parameters L and α, is calculated the distance d with 

Eq. (5.71). Considering the triangle O P S the polar coordinates (d, β) of the point P in the plane (X, Z ) are calculated in Cartesian coordinates (X P , Y P ) as follows11 : X P = d · cos β

Z P = h = d · sin β

(5.72)

The extension to the 3D projection of P is immediate considering the pinhole model so that from the similarity of the triangles generated by the projection of P in the image plane we have x y f = = (5.73) Z X Y

11 Obtained according to the trigonometric formulas of the complementary angles (their sum is a right angle) where in this case we have the complementary angle ( π2 − β).

5.6 Shape from Structured Light

453

Considering the right triangle with base (L − X ) in the baseline O Q (see Fig. 5.17) we get Z (5.74) tan α = L−X From Eqs. (5.73) and (5.74) we can derive: fX Z= = (L − X ) · tan α x



=⇒

 f X + tan α = L · tan α x

(5.75)

Therefore, considering the equality of the relations of the (5.73) and the last expression of the (5.75), we get the 3D coordinates of P given by [X Y Z ] =

L · tan α [x y z] f + x · tan α

(5.76)

It should be noted that the resolution of the depth measurement Z given by (5.76) is related to the accuracy with which α is measured and the coordinates (x, y) determined for each point P of the scene (illuminated) projected in the image plane. It is also observed that, to calculate the distance of P, the angle γ was not considered (see Fig. 5.17). This depends on the fact that the projected structured light is a vertical light plane (not a ray of light) perpendicular to the X Z plane and forms an angle α with the X -axis. To calculate the various depth points, it is necessary to project the light spot in different areas of the scene to obtain a 2D depth map by applying (5.76) for each point. This technique using a single mobile light spot (varying α) is very slow and inadequate for dynamic scenes. Normally, systems with structured light are used, consisting of a vertical light lamina (light plane) that scans the scene by tilting that lamina with respect to the Y -axis as shown in Fig. 5.18. In this case, the projection angle of the  plane of laser light gradually changed to capture the entire scene in amplitude. As before,

Camera

y

Laser

Fig. 5.18 Illustration of the triangulation geometry between laser light plane and outgoing ray from the optical center of the calibrated camera that intersects a point of the illuminated 3D object in P

p

x Ray camera

Light plane

Y

π

X

P

Z

an

Sc

n

io

ct

re

di

454

5 Shape from Shading

the camera-projector system is calibrated geometrically in order to recover the depth map from the images captured wherein each image is shifted the projection of the light lamina that appears as a luminous silhouette based on the shape and orientation of the objects in the scene relative to the camera. Essentially, with the intersection of vertical light lamina and surface of objects, the camera sees the vertical light lamina as a broken curve with various orientations. Points of the 3D objects are determined in the image plane and their 3D reconstruction is obtained by calculating the intersection between light planes (whose spatial position is known) with a ray that starts at the center of the camera and passes through the corresponding p pixel of the image (see Fig. 5.18). Scanning of the complete scene can also be done by rotating the objects and leaving the light plane source fixed. In 3D space the light plane is expressed by the equation of the plane in the form AX + BY + C Z + D = 0. From the equalities of the ratios (5.73) and considering the light plane equation, we can derive the fundamental equations of the pinhole projection model and calculate the 3D coordinates of the points of the scene, illuminated by the light plane, as follows: X=

Zx f

Y =

Zy f

Z=

−D f Ax + By + C f

(5.77)

As an alternative to the projection of rays or light planes, structured light of static 2D light patterns (for example, a square grid with note geometry or stripe patterns) can be projected onto the objects of the scene and then the image from the camera of the whole scene is acquired with the superimposed light grid (see Fig. 5.19). In this way, from the image points of interest are detected from the grid patterns deformed by the curved surface of the objects of the scene. With a single acquisition of the scene, it is possible to calculate, from the determined points of interest of the deformed grid, the depth map. The calculation of the depth map in this case depends on the accuracy with which the projected interest patterns (points or light strips) necessary for triangulation are determined.

5.6.1 Shape from Structured Light with Binary Coding The techniques most used are those based on the sequential projection of coded light patterns (binary, gray levels, or in color) to eliminate the ambiguity in identifying patterns associated with the surface of objects with different depths. It is, therefore, necessary to uniquely determine the patterns of multiple strips of light seen by the camera projected onto the image plane comparing them with those of the original pattern. The process that compares, the projected patterns (for example, light binary strips) with the corresponding original projected patterns (known a priori), is known as the process of decoding the patterns the equivalent of the process of searching for correspondence in binocular vision. In essence, the decoding of the patterns consists in locating them in the image and finding their correspondence in the plane of the projector of which it is known how they were coded.

Fig. 5.19 Light grid, with known geometry, projected onto a 3D object whose deformation points of interest are detected to reconstruct the shape of the observed curved surface

455

Laser

5.6 Shape from Structured Light

Camera

Binary light pattern projection techniques involve projecting light planes onto objects where each light plane is encoded with appropriate binary patterns [16]. These binary patterns are uniquely encoded by black and white strips (bands) for each plane, so that when projected in a time sequence (the strips increase their width over time) each point on the surface of the objects is associated with a single binary code distinct from the other codes of different points. In other words, each point is identified by the intensity sequence it receives. If the patterns are n (i.e., the number of planes to be projected) then you can code 2n strips (that is, in the image 2n regions are identified). Each strip represents a specific α angle of the projected light plane (which can be vertical or horizontal or both directions depending on the type of scan). Figure 5.20 shows a set of luminous planes encoded with binary strips to be projected in a temporal sequence on the scene to be reconstructed. In the figure, the number of the patterns is 5 and the coding of each plane represents the binary configuration of pattern 0 and 1 to indicate light off and on, respectively. The figure also shows the temporal sequence of the patterns with the binary coding to uniquely associate the code (lighting code) for each of the 25 strips. Each acquired image relative to the projected patten is in fact a bit-plane which together form a bit-plane block. This block contains the n-bit sequences that establish the correspondence between all the points of the scene and their projection in the image plane (see Fig. 5.20).

456

5 Shape from Shading

Projection sequence

Horizontal spatial distribution 1 2 3 4 5

P

c oje Pr

tio

ns

Projector

u eq

en

ce

y Observed code P(10100)

p

Camera

x

Fig. 5.20 3D reconstruction of the scene by projecting in time sequence 5 pattern planes with binary. The observed surface is partitioned into 32 regions and each pixel is encoded in the example by a unique 5-digit binary code

5.6.2 Gray Code Structured Lighting Binary coding provides two levels of light intensity encoded with 0 and 1. Binary coding can be made more robust by using the concept of Gray code12 where each band is encoded in such a way that two adjacent ones differ in a bit which is the maximum possibility of error in the encoding of the bands (see Fig. 5.21). The number of images with the Gray code is the same as the binary code and each image is a bitplane of the Gray code that represents the luminous pattern plane to be projected. The transformation algorithm from binary code to Gray is a simple recursive procedure (see Algorithm 24). The inverse recursive procedure that transforms a Gray code into a binary sequence is shown in Algorithm 25. Once the images are acquired with the patterns superimposed on the surface of the objects, with the segmentation the 2n bands are univocally codified and finally it is possible to calculate the relative 3D coordinates with a triangulation process and

12 Named

Gray code after Frank Gray, a researcher at Bell Laboratories in 1953, patented it). Also known as Reflected Binary Code (RBC) which is a binary coding method where two successive values differ only by one bit or a binary digit. RBC was originally designed to prevent spurious errors for various electronic devices; today widely used in digital transmission. Basically, the graycode is based on the Hamming distance (in this case 1) which evaluates the number of digit substitutions to make two strings of the same length equal.

Projection sequence

5.6 Shape from Structured Light

457

5 01 1 0 01 1 0 01 10 0 1 1 0 0 11 0 0 1 1 0 0 1 1 0 0 1 1 0 4 00 1 1 11 0 0 00 11 1 1 0 0 0 01 1 1 1 0 0 0 0 1 1 1 1 0 0 3 00 0 0 11 1 1 11 11 0 0 0 0 0 00 0 1 1 1 1 1 1 1 1 0 0 0 0 2 00 0 0 00 0 0 11 11 1 1 1 1 1 11 1 1 1 1 1 0 0 0 0 0 0 0 0 1 00 0 0 00 0 0 00 00 0 0 0 0 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 Horizontal spatial distribution

Fig. 5.21 Example of a 5 − bit Gray code that generates 32 bands with the characteristic that adjacent bands only differ by 1 bit. It can be seen from Fig. 5.20 the comparison with structured light planes with binary coding always at 5bit

Algorithm 24 Pseudo-code to convert a binary number B to G Gray code. 1: 2: 3: 4: 5:

Bin2Gray(B) n ← length(B) G(1) ← B(1) for i ← 2 to n do G(i) ← B(i − 1) xor B(i)

6: end for 7: return G

Algorithm 25 Pseudocode to convert a Gray code G to B binary number. 1: 2: 3: 4: 5:

Gray2Bin(G) n ← length(G) B(1) ← G(1) for i ← 2 to n do B(i) ← B(i − 1) xor G(i)

6: end for 7: return B

obtain a depth map. The coordinates of each pixel (X, Y, Z ) (along the 2n horizontal bands) are calculated from the intersection between the plane passing through the vertical band and the optical projection center with the straight line passing through the optical center of the camera calibrated and points of the band (see Fig. 5.20), according to Eq. (5.77). The segmentation algorithm required is simple since they are normally wellcontrasted binary bars on the surface of objects and, except in shadow areas, the projected luminous pattern plane does not optically interfere with the surface itself. However, to obtain an adequate spatial resolution different pattern planes must be projected. For example, to have a band resolution of 1024 you need log2 1024 = 10 pattern planes to project and then process 10 bit-plane images. Overall, the method

458

5 Shape from Shading

has the advantage of producing depth maps with high resolution and accuracy in the order of µ m and reliable using the Gray code. The limits are related to the static nature of the scene and the considerable computational time when a high spatial resolution is required.

5.6.3 Pattern with Gray Level To improve the 3D resolution of the acquired scene and at the same time to reduce the number of pattern planes it is useful to project bright pattern planes at gray levels [17] (or colored) [18]. In this way, the code base is increased instead of the binary coding. If m is the number of gray (or color) levels and n is the number of pattern planes (known as codes n-ary) we will have mn bands and each band is seen as a space point of n − dimensions. For example with n = 3 and using only m = 4 gray levels, we would have 43 = 64 unique codes to characterize the bands, against the 6 pattern planes required with binary coding.

5.6.4 Pattern with Phase Modulation We have previously considered patterns based on binary coding, Gray code, and on n-ary coding that have the advantage of encoding individual pixel regions without spatially depending on neighboring pixels. A limitation of these methods is given by the poor spatial resolution. A completely different approach is based on the Phase Shift Modulation [19,20], which consists of projecting different modulated periodic light patterns with a constant phase shift in each projection. In this way, we have a high-resolution spatial analysis of the surface with the projection of sinusoidal luminous patterns (fringe patterns) with constant phase shift (see Fig. 5.22).

2π-Φ(x,y)

2π Ia(x,y) θ

x

)

θ

x

x

Fig. 5.22 The phase shift based method involves the projection of 3 planes of sinusoidal light patterns (on the right, an image of one of the luminous fringe planes projecting into the scene is displayed) modulated with phase shift

5.6 Shape from Structured Light

459

If we consider an ideal model of image formation we will have that every point of the scene receives the luminous fringes perfectly in focus and not conditioned by other light sources. Therefore, the intensity of each pixel (x, y) of the images Ik , k = 1.3, acquired by projecting three planes of sinusoidal luminous fringes with constant shift of the phase angle θ , is given by the following: I1 (x, y) = Io (x, y) + Ia (x, y) cos[φ(x p , y p ) − θ ] I2 (x, y) = Io (x, y) + Ia (x, y) cos[φ(x p , y p )] I3 (x, y) = Io (x, y) + Ia (x, y) cos[φ(x p , y p ) + θ ]

(5.78)

where Io (x, y) is an offset that includes the contribution of other light sources in the environment, Ia (x, y) is the amplitude of the modulated light signal,13 φ(x p , y p ) is the phase of the luminous pixel of the projector which illuminates the point of the scene projected in the point (x, y) in the image plane. The phase φ(x p , y p ) provides the matching information in the triangulation process. Therefore, to calculate the depth of the observed surface, it is necessary to recover the phase of each pixel (a process known as wapped phase) relative to the three projections of sinusoidal fringes starting from the three images Ik . The phases φk , k = 1, 3 recovered are then combined to obtain a unique phase φ, unambiguous, through a procedure known as unwrapped phase.14 Therefore, phase unwrapping is a trivial operation if the context of the wrapped phases is ideal. However, in real measurements various factors (e.g., presence of shadows, low modulation fringes, nonuniform reflectivity of the object’s surface, fringe discontinuities, noise) influence the phase unwrapping process. As we shall see, it is possible to use a heuristic solution to solve the phase unwrapping problem which attempts to use continuity data on the measured surface to move data when it has obviously crossed the border even though it is not an ideal solution and does not completely manage the discontinuity.

value of Ia (x, y) is conditioned by the BRDF function of the point of the scene, by the response of the camera sensor, by the arrangement of the tangent plane in that point of the scene (as seen from foreshortening by the camera) and by the intensity of the projector. 14 It is known that the phase of a periodic signal is univocally defined in the main interval (−π ÷ π ). As shown in the figure, fringes with sinusoidal intensity are repeated for different periods to cover the entire surface of the objects. But this creates ambiguity (for example, 20◦ are equal to 380◦ and 740◦ to derive, from the gray levels of the acquired images (5.77), the phase is calculable to less than multiples of 2, which is known as a wapped phase. The recovery of the original phase values from the values in the main interval is a classic problem in signal processing known as the process of phase unwrapping. Formally, the phase unwrapping means that, given the wapped phase ψ ∈ (−π, π ), it needs to find the true phase φ which is related to ψ as follows: 13 The

 φ  2π   where W is the phase wrapping operator and the expression • rounds its argument to the nearest integer. It is shown that the phase unwrapping operator is generally a mathematically ill-posed problem and is usually solved through algorithms based on heuristics that give acceptable solutions. ψ = W(φ) = φ − 2π

460

5 Shape from Shading

In this context, the phase unwrapping process that calculates the absolute (true) phase ψ must be derived from the wrapped phase φ based on the observed intensities given by the images Ik , k = 1, 2, 3 of light fringes, that is, Eq. (5.77). It should be noted that in these equations we have the terms Io and Ia that are not known (we will see later that they will be removed) while the phase angle φ is the unknown. According to the algorithm proposed by Huang and Zhang [19], the wrapped phase is given by combining the intensities Ik as follows: cos(φ − θ) − cos(φ + θ) I1 (x, y) − I3 (x, y) = 2I2 (x, y) − I1 (x, y) − I3 (x, y) 2 cos(φ) − cos(φ − θ) − cos(φ + θ) 2 sin(φ) sin(θ) = (add/sub trigon. funcs) 2 cos(φ)[1 − cos(θ)] tan(φ) sin(θ) = (tangent half-angle formula) 1 − cos(θ) tan(φ) (5.79) = tan(θ/2)

from which the removal of the dependence on the terms Io and Ia . Considering the final result15 reported by (5.79), the phase angle, expressed in relation to the observed intensities, is obtained as follows:   √ I1 (x, y) − I3 (x, y) (5.80) ψ(0, 2π ) = arctan 3 2I2 (x, y) − I1 (x, y) − I3 (x, y) √ where θ = 120◦ is considered for which tan(θ/2) = 3. (5.80) gives the phase angle of the pixel in the local period from the intensities. To remove the ambiguity of the discontinuity of the arctangent function in 2π , we need to add or subtract multiples of 2π to the calculated phase angle ψ, which is to find the phase unwrapping (see Note 14 and Fig. 5.23) given by φ(x, y) = ψ(x, y) + 2π k(x, y)

k ∈ (0, . . . , N − 1)

(5.81)

where k is an integer representing the projection period while N is the number of projected fringes. It should be noted that the applied phase unwrapping process provides only the relative phase φ and not the absolute one to reconstruct the depth of each pixel. To estimate the 3D coordinates of the pixel, it is necessary to calculate a reference phase φref with respect to which to determine by triangulation with projector and camera the relative phase φ for each pixel. Figure 5.24 shows the triangulation scheme with the reference plane and the surface of the objects where the fringed patterns are projected with respect to which the phase unwrapping φref is obtained and the one relative to the object φ. Considering the similar triangles of the figure, the following relation is obtained: z d = Z −z L

f r om which

z=

Z −z d L

considering the tangent half-angle formula in the version tan θ/2 = θ = k · 180◦ .

15 Obtained

(5.82)

1−cos θ sin θ

valid with

5.6 Shape from Structured Light

461

Φ(x,y)

Φ(x,y) 6π



Unwrapped Phase

4π π 2π

π/2

Wrapped Phase

0 -π/2

π



0

x

x

Current Phase

Fig. 5.23 Illustration of the phase unwrapping process. The graph on the left shows the conversion of the phase angle φ(x, y) with module 2π , while the graph on the right shows the result of the unwrapped phase Fig. 5.24 Calculation of the depth z by triangulation and the value of the phase difference with respect to the reference plane

d Reference Plane z

3D object

Z

P(X,Y,Z)

Camera

L

Projector

where z is the height of a pixel with respect to the reference plane, L is the separation distance between projector and camera, Z is the perpendicular distance between the reference plane and the segment joining the optical centers of camera and projector, and d is the separation distance of the projection points of P (point of the object surface) in the reference plane obtained by the optical rays (of the projector and camera) passing through P (see Fig. 5.24). Considering Z  z, the (5.82) can be simplified as follows: Z Z (5.83) z ≈ d ∝ (φ − φref ) L L where the phase unwrapping φref is obtained by projecting and acquiring the fringe patterns on the reference plane in the absence of the object, while φ is obtained by repeating the scan with the presence of the object. In essence, the heights (depth) of the object’s surface, once the scanning system has been calibrated (with known L and Z ) and by virtue of the triangulation is determined d (a sort of disparit y of the phase unwrapping), are calculated with Eq. (5.83).

462

5 Shape from Shading

5.6.5 Pattern with Phase Modulation and Binary Code Previously, we have highlighted the problem of ambiguity with the use of the method based on phase shift modulation and the need to find a solution for phase unwrapping that does not resolve the absolute phase unequivocally. This ambiguity can be resolved by combining this method that projects periodic patterns with that of the Gray code pattern projection described above. For example, even projecting 3 binary patterns would have the surface of the object divided into 8 regions while projecting the periodic patterns would have an increase in spatial resolution with a more accurate reconstruction of the depth map. In fact, once the phase of a given pixel has been calculated, the period of the sinusoid where the pixel lies is obtained from the region of belonging associated with the binary code. Figure 5.25 gives an example [21] which combines the binary code method (Gray code) and the one with phase shift modulation. There are 32 binary code sequences to partition the surface, determining the phase interval unambiguously, while phase shift modulation reaches a subpixel resolution beyond the number of split regions expected by the binary code. As shown in the figure, the phase modulation is achieved by approximating the sine function with on/off intensity of the patterns generated by a projector. These patterns are then translated into steps of π/2 for a total of 4 translations. These last two methods have the advantage of operating independently from the environmental lighting conditions but have the disadvantage of requiring different light projection patterns and not suitable for scanning dynamic objects.

5.6.6 Methods Based on Colored Patterns Methods based on the projection of sequential patterns have the problem of being unsuitable for acquiring depth maps in the context of dynamic scenes (such as moving

Gray code sequence

1 2 3 4

0

1 2 3 4

5 6

7 8 9 10 1112 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

0

1 2 3 4

5 6

7 8 9 10 1112 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Phase Shift Sequence

5 6 7 8

Fig. 5.25 Method that combines the projection of 4 Gy binary pattern planes with phase shift modulation achieved by approximating the sine function with 4 angular translations with a step of π/2 thus obtaining a sequence of 32 coded bands

5.6 Shape from Structured Light

463

people or animals, for example). In these contexts, real-time 3D image acquisition systems are used based on colored pattern information or on the projection of light patterns with a unique encoding scheme that require a single acquisition (full frame, i.e., in each pixel of the image, the distance between the camera and the point of the scene is accurately evaluated for 3D reconstruction of the scene surface in terms of coordinates (X, Y, Z ). The Rainbow 3D Camera [22] illuminates the surface to be reconstructed with light with spatially variable wavelength and establishes a one-to-one correspondence between the projection angle θ of a light plane and the particular spectral wavelength λ realizing thus a simple identification of the light patterns on each point of the surface. It basically solves the problem of correspondence that we have with the binocular system. In fact, note the baseline L and the angle of view α, the distance value corresponding to each pixel is calculated using the triangulation geometry, and the full frame image is obtained with a single acquisition at the speed of a video signal (30 f rame/sec or higher) also in relation to the spatial resolution of the sensor (for example, 1024×1300 pixels/frame). To solve the problem of occlusions in the case of complex surfaces systems have been developed that scan the surface by projecting indexed colored strips [23]. Using RGB color components with 8-bit channels, we get up to 224 different colors. With this approach, by projecting patterns of indexed colored strips, the ambiguity that occurs with the phase shift modulation method or with the method that projects multiple monochromatic strip patterns is attenuated. Another approach is to consider strips that can be distinguished from each other if they are made with pattern segments of various lengths [24]. This technique can be applied to continuous and curved and not very complex 3D surfaces otherwise it would be difficult to identify the uniqueness of the pattern segment.

5.6.7 Calibration of the Camera-Projector Scanning System As with all 3D reconstruction systems, even with structured light approaches, a camera and projection system calibration phase is provided. In the literature, there are various methods [25]. The objective is to estimate the intrinsic and extrinsic parameters of the camera and projection system with appropriate calibration according to the resolution characteristics of the camera and the projection system itself. The camera is calibrated, assuming a perspective projection model (pinhole), looking at different angles a white-black checkerboard reference plane (with known 3D geometry) and establishing a nonlinear relationship between the spatial coordinates (X, Y, Z ) of 3D points of the scene and the coordinates (x, y) of the same points projected in the image plane. The calibration of the projector depends on the scanning technology used. In the case of projection of a pattern plane with known geometry, the calibration is done by calculating the homography matrix (see Sect. 3.5 Vol. II) which establishes a relationship between points of the plane of the patterns projected on a plane and the same points observed by the camera considering known the separation distance between the projector and the camera, and known the intrinsic parameters

464

5 Shape from Shading

(focal, sensor resolution, center of the sensor plane, ...) of the camera itself. Estimated the homography matrix, it can be established, for each projected point, a relationship between projector plane coordinates and those of the image plane. The calibration of the projector is twofold providing for the calibration of the active light source that projects the light patterns and the geometric calibration seen as a normal reverse camera. The calibration of the light source of the projector must ensure the stability of the contrast through the analysis of the intensity curve providing for the projection of light patterns acquired by the camera and establishing a relationship between the intensity of the projected pattern and the corresponding values of the pixels detected from the camera sensor. The relationship between the intensities of the pixels and that of the projected patterns determines the function to control the linearity of the lighting intensity. The geometric calibration of the projector consists of considering it as a reverse camera. The optical model of the projector is the same as that of the camera (pinhole model) only the direction changes. With the inverse geometry, it is necessary to solve the difficult problem of detecting in the plane of the projector a point of the image plane, projection of a 3D point of the scene. In essence, the homography correspondence between points of the scene seen simultaneously by the camera and the projector must be established. Normally the camera is first calibrated with respect to a calibration plane for which a homography relation is established H between the coordinates of the calibration plane and those of the image plane. Then the light patterns of known geometry calibration are projected onto the calibration plane and acquired by the camera. With the homography transformation H, the known geometry patterns projected in the calibration plane are known in the reference system of the camera, that is, they are projected homographically in the image plane. This actually accomplishes the calibration of the projector with respect to the camera having established with H the geometrical transformation between points of the projector pattern plane (via the calibration plane) and with the inverse transform H −1 the mapping between image plane and pattern plane (see Sect. 3.5 Vol. II). The accuracy of the geometric calibration of the projector is strictly dependent on the initial calibration of the camera itself. The calculated depth maps and in general the 3D surface reconstruction technologies with a shape from structured light approach are widely used in the industrial applications of vision, where the lighting conditions are very variable and a passive binocular vision system would be inadequate. In this case, structured light systems can be used to have a well-controlled environment as required, for example, for robotized cells with the movement of objects for which the measurements of 3D shapes of the objects are to be calculated at time intervals. They are also applied for the reconstruction of parts of the human body (for example, facial reconstruction, dentures, 3D reconstruction in plastic surgery interventions) and generally in support of CAD systems.

5.7 Shape from (de)Focus

465

5.7 Shape from (de)Focus This technique is based on the depth of field of optical systems that is known to be finite. Therefore, only objects that are in a given depth interval that depends on the distance between the object and the observer and the characteristics of the optics used are perfectly in focus in the image. Outside this range, the object in the image is blurred in proportion to its distance from the optical system. Remember to have used (see Sect. 9.12.6 Vol. I) as a tool for blurring the image the convolution process by means of appropriate filters (for example Gaussian filter, binomial, etc.) and that the same process of image formation is modelable as a process of convolution (see Chap. 1 Vol. I) which intrinsically introduces blurring into the image. By proceeding in the opposite direction, that is, from the estimate of blurring observed in the image, it is possible to estimate a depth value knowing the parameters of the acquisition system (focal length, aperture of the lens, etc.) and the transfer function with which it is possible to model the blurring (for example, convolution with Gaussian filter). This technique is used when one wants to obtain qualitative information of the depth map or when one wants to integrate the depth information with that obtained with other techniques (data fusion integrating, for example, with depth maps obtained from stereo vision and stereo photometry). Depth information is estimated with two possible strategies: 1. Shape from Focus (SfF): It requires the acquisition of a sequence of images of the scene by varying the acquisition parameters (object–optical-sensor distances) thus generating images from different blurring levels up to the maximum sharpness. The objective is to search in the sequence of images for maximum sharpness and, taking note of the current parameters of the system, to evaluate the depth information for each point on the scene surface. 2. Shape from Defocus (SfD): The depth information is estimated by capturing at least two blurred images and by exploring blurring variation in the images acquired with different settings of optical-sensor system parameters.

5.7.1 Shape from Focus (SfF) Figure 5.26 shows the basic geometry of the image formation process on which the shape from focus proposed in [26] is based. The light reflected from a point P of the scene is refracted by the lens and converges at the point Q in the image plane. From the Gaussian law of a thin lens (see Sect. 4.4.1 Vol. I), we have the relation, between the distance p of the object from the lens, distance q of the image plane from the lens and focal length f of the lens, given by 1 1 1 + = p q f

(5.84)

466

5 Shape from Shading

According to this law, points of the object plane are projected into the image plane (where the sensor is normally placed) and appear as well-focused luminous points thus forming in this plane the image I f (x, y) of the scene resulting perfectly in focus. If the plane of the sensor does not coincide with that of the image but is shifted to the distance δ (before or after the image plane in focus, in the figure it is translated after), the light coming from the point P of the scene, refracted from the lens, it undergoes a dispersion and in the sensor plane the projection of P in Q is blurred due to the dispersion of light and appears as a blurred circular luminous spot, assuming a circular aperture of the lens.16 This physical blurring process occurs at all points in the scene, resulting in a blurred image in the sensor plane Is (x, y). Using similar triangles (see Fig. 5.26), it is possible to derive a formula to establish the relationship between the radius of the blurred disk r and the displacement δ of the sensor plane from the focal plane, obtaining r δ = R q

f r om which

r=

δR q

(5.85)

where R is the radius of the lens (or aperture). From Fig. 5.26, we observe that the displacement of the sensor plane from the image focal plane is given by: δ =i −q . It is pointed out that the intrinsic parameters of the optical and camera system are (i, f, and R). The dispersion function that models point blurring in the sensor plane can be modeled in physical optics.17 The approximation of the physical model of point blurring can be achieved with the two-dimensional Gaussian function in the hypothesis of limited diffraction and incoherent illumination.18 Thus, the blurred image Is (x, y) can be obtained through the convolution of the image in focus I f (x, y) with the PSF Gaussian function h(x, y), as follows: Is (x, y) = I f (x, y)  h(x, y)

16 This

(5.86)

circular spot is also known as confusion circle or confusion disk in photography or blur circle, blur spot in image processing. 17 Recall from Sect. 5.7 Vol. I that in the case of circular openings the light intensity distribution occurs according to the Airy pattern, a series of concentric rings that are always less luminous due to the diffraction phenomenon. This distribution of light intensity on the image (or sensor) plane is known as the dispersion function of a luminous point (called PSF—Point Spread Function). 18 Normally, the formation of images takes place in conditions of illumination from natural (or artificial) incoherent radiation or from (normally extended) non-monochromatic and unrelated sources where diffraction phenomena are limited and those of interference cancel each other out. The luminous intensity in each point is given by the sum of the single radiations that are incoherent with each other or that do not maintain a constant phase relationship. The coherent radiations are instead found in a constant phase relation between them (for example, the light emitted by a laser).

5.7 Shape from (de)Focus Object Plan in focus

i

Object Plane

O

Image Plane

f

Sensor Plane

δ

P R

Fig. 5.26 Basic geometry in the image formation process with a convex lens

467

Q Q’

p

2r

q

Δp

If(x,y) Is(x,y)

with − 1 e h(x, y) = √ 2 2π σh

x 2 +y 2 2σh2

(5.87)

where the symbol “” indicates the convolution operator, σh is the dispersion parameter (constant for each point P of the scene, assuming the convolution a spatially invariant linear transformation) that controls the level of blurring corresponding to the standard deviation of the 2D Gaussian Point Spread Function (PSF), and is assumed to be proportional [27,28] to the radius r . Blurred image formation can be analyzed in the frequency domain, where it is observed how the Optical Transfer Function (OTF) which corresponds to the Fourier transform of the PSF is characterized. By indicating with I f (u, v), Is (u, v) and H(u, v), the Fourier transforms, respectively, of the image in focus, the blurred image and the Gaussian PSF, the convolution expressed by the (5.86) in the Fourier domain results in the following: Is (u, v) = I f (u, v) · H(u, v) where H(u, v) = e−

u 2 +v 2 2 2 σh

(5.88)

(5.89)

From Eq. (5.89), which represents the optical transfer function of the blur process, its dependence on the dispersion parameter σh is explicitly observed, and indirectly depends on the intrinsic parameters of the optical and camera system considering that σh ∝ k ·r is dependent on r unless a proportionality factor k is, in turn, dependent on the characteristics of the camera and can be determined from a previous calibration of the same camera. Considering the circular symmetry of the OTF, expressed by (5.89), with still Gaussian form, the blurring is due only to the passage of low frequencies and the cutting of high frequencies in relation to the increase of σh , in turn conditioned to

468

5 Shape from Shading

the increase of δ and consequently of r according to (5.85). Therefore, the blurring of the image is attributable to a low-pass filtering operator, where the bandwidth decreases as the blur increases (the standard deviation of the Gaussian that model the PSF and OTF are inversely proportional to each other, for details see Sect. 9.13.3 Vol. I). So far we have examined the physical–optical aspects of the process of forming an image (in focus or out of focus), together with the characteristic parameters of the acquisition system. Now let’s see how, from one or more images, it is possible to determine the depth map of the scene. From the image formation scheme of Fig. 5.26, it emerges that a defocused image can be obtained in different ways: 1. Translating the sensor plane with respect to the image plane where the scene is in perfect focus. 2. Translating the optical system. 3. By translating the objects of the scene relative to the object plane, against which, the optical system focuses on the image plane the scene. Normally, of a 3D object only the points belonging to the object plane are perfectly in focus, all the other points, before and after the object plane, are acceptably or less in focus, in relation to the depth of field of the system optical. The mutual translation between the optical system and the sensor plane (modes 1 and 2) introduces a scale factor (of apparent reduction or enlargement) of the scene by varying the coordinates in the image plane of the points of the scene and a variation of intensity in the acquired image, caused by the different distribution of irradiance in the sensor plane. These drawbacks are avoided by acquiring images translating only the scene (mode 3) with respect to a predetermined setting of the optical-sensor system, thus keeping the scale factor of the acquired scene constant. Figure 5.27 shows the functional scheme of an approach shape from focus proposed in [26]. We observe the profile of the surface S of the unknown scene whose depth is to be calculated and in particular a surface element ( patch) s is highlighted. We distinguish a reference base with respect to which the distance d f of the focused object plane is defined and the distance d from the object-carrying translation basis is defined simultaneously. These distances d f and d can be measured with controlled resolution. Now consider the patch s and the situation where the base moves toward the focused object plane (i.e., toward the camera). It will have that in the images in acquisition the patch s will tend to be more and more in focus reaching the maximum when the base reaches the distance d = dm , and then begin the process of defocusing as soon as it exceeds the focused object plane. If for each step d of translation, the distance d of the base is registered and the blur level of the patch s, we can evaluate the estimate of the height (depth) ds = d f − dm at the value d = dm where the patch has the highest level of focus. This procedure is applied for any patch on the surface S. Once the system has been calibrated, from the height ds , the depth of the surface can also be calculated with respect to the sensor plane or other reference plane.

5.7 Shape from (de)Focus translation step

s

S d df

Object Plane

Sensor Plane

Δd

mobile base

Fig. 5.27 Functional scheme of the Shape from Focus approach

469

Reference base

Once the mode of acquisition of image sequences has been defined, to determine the depth map it is necessary to define a measurement strategy of the level of blurring of the points of the 3D objects, not known, placed on a mobile base. In the literature various metrics have been proposed to evaluate in the case of Shape from Focus (SfF) the progression of focusing of the sequence of images until the points of interest of the scene are in sharp focus, while in the case of Shape from Defocus (SfD) the depth map is reconstructed from the blurring information of several images. Most of the proposed SfF metrics [26,29,30] measure the level of focus by considering local windows (which include a surface element) instead of the single pixel. The goal is to automatically extract the patches of interest with the dominant presence of strong local intensity variation through ad hoc operators that evaluate from the presence of high frequencies the level of focus of the patches. In fact, patches with high texture, perfectly in focus, give high responses to high-frequency components. Such patches, with maximum responses to high frequencies can be detected by analyzing the sequence of images in the Fourier domain or the spatial domain. In Chap. 9 Vol. I, several local operations have been described for both domains characterized by different high-pass filters. In this context, the linear operator of Laplace (see Sect. 1.12 Vol. II) is used, based on the differentiation of the second order, which accentuates the variations in intensity and is found to be isotropic. Applied for the image I (x, y), the Laplacian ∇ 2 is given by ∇ 2 I (x, y) =

∂ 2 I (x, y) ∂ 2 I (x, y) + = I (x, y)  h ∇ (x, y) ∂x2 ∂ y2

(5.90)

calculable in each pixel (x, y) of the image. Equation (5.90), in the last expression, also expresses the Laplacian operator in terms of convolution, considering the function PSF Laplacian h ∇ (x, y) (described in detail in Sect. 1.21.3 Vol. II). In the frequency domain, indicating with F the Fourier transform operator, the Laplacian of image I (x, y) is given by F {∇ 2 I (x, y)} = L∇ (u, v) = H ∇ (u, v) · I(u, v) = −4π 2 (u 2 + v2 )I(u, v) (5.91)

470

5 Shape from Shading

which is equivalent to multiplying the spectrum I(u, v) by a factor proportional to the frequencies (u 2 + v2 ). This leads to the accentuation of the high spatial frequencies present in the image. Applying the Laplacian operator (5.86) to the blurred image Is (x, y) in the spatial domain, for (5.90), we have ∇ 2 Is (x, y) = h ∇ (x, y)  Is (x, y) = h ∇ (x, y)  [h(x, y)  I f (x, y)]

(5.92)

where we remember that h(x, y) is the Gaussian PSF function. For the associative property of convolution, the previous equation can be rewritten as follows: ∇ 2 Is (x, y) = h(x, y)  [h ∇ (x, y)  I f (x, y)]

(5.93)

Equation (5.93) informs us that, instead of directly applying the Laplacian operator to the blurred image Is with (5.92), it is also possible to apply it first to the focused image I f and then blur the result obtained with the Gaussian PSF. In this way, with the Laplacian only the high spatial frequencies are obtained from the I f and subsequently attenuated with the Gaussian blurring, useful for attenuating the noise normally present in the high-frequency components. In the Fourier domain, the application of the Laplacian operator to the blurred image, considering also Eqs. (5.89) and (5.91), results in the following: Is (u, v) = H(u, v) · H ∇ (u, v) · I f (u, v) = −4π 2 (u 2 + v2 )e−

u 2 +v 2 2 2 σh

I f (u, v)

(5.94)

We highlight how in the Fourier domain, for each frequency (u, v), the transfer function H(u, v) · H ∇ (u, v) (produced between the Laplacian operator and Gaussian blurring filter) has a Gaussian distribution controlled by the blurring parameter σh . Therefore, a sufficiently textured image of the scene will present a richness of high frequencies emphasized by the Laplacian filter H ∇ (u, v) and attenuated by the contribution of the Gaussian filter according to the value of σh . The attenuation of high frequencies is almost nil (ignoring any blurring due to the optical system) when the image of the scene is in focus with σh = 0. If the image is not well and uniformly textured, the Laplacian operator does not guarantee a good measure of image focusing as the operator would hardly select dominant high frequencies. Any noise present in the image (due to the camera sensor) would introduce high spurious frequencies altering the focusing measures regardless of the type of operator used. Normally, noise would tend to violate the spatial invariance property of the convolution operator (i.e., the PSF would vary spatially in each pixel h σ (x, y)). To mitigate the problems caused by the noise by working with real images, the focusing measurements obtained with the Laplacian operator are calculated locally in each pixel (x, y) by adding the significant ones included in a support window x,y of n × n size and centered in the pixel in elaboration (x, y). The focus measures

5.7 Shape from (de)Focus

471

M F(x, y) would result in the following:  M F(x, y) = ∇ 2 I (i, j)

for

∇ 2 I (i, j) ≥ T

(5.95)

(i, j)∈x,y

where T indicates a threshold value beyond which a Laplacian value of one pixel is considered significant in the support window  of the Laplacian operator. The size of  (normally square window equal to or greater than 3 × 3) is chosen in relation to the size of the texture of the image. It is evident that with the Laplacian, the partial second derivatives in the direction of the horizontal and vertical components can have equal and opposite values, i.e., I x x = −I yy =⇒ ∇ 2 I = 0, thus canceling reciprocally. In this case, the operator would produce incorrect answers, even in the presence of texture, as the contributions of the high frequencies associated with this texture would be canceled. Nayar and Nakagawa [26], to prevent the cancelation of such high frequencies, they proposed a modified version of the Laplacian, known as Modified Laplacian—ML given by  2   2  ∂ I  ∂ I  2 (5.96) ∇ M I =  2  +  2  ∂x ∂y Compared to the original Laplacian the modified one is always greater or equal. To adapt the possible dimensions of the texture, it is also proposed to calculate the partial derivatives using a variable step s ≥ 1 between the pixels belonging to the window x,y . The discrete approximation of the modified Laplacian is given by 2 I (x, y) = | − I (x + s, y) + 2I (x, y) − I (x − s, y)| + | − I (x, y + s) + 2I (x, y) − I (x, y − s)| ∇M D

(5.97)

The final focus measure S M L(x, y), known as Sum Modified Laplacian, is calculated 2 I , given by as the sum of the values of the modified Laplacian ∇ M D S M L(x, y) =



2 ∇M D I (i, j)

f or

2 ∇M D I (i, j) ≥ T

(5.98)

(i, j)∈x,y

Several other focusing operators are reported in the literature based on the gradient (i.e., on the first derivative of the image) which in analogy to the Laplacian operator evaluates the edges present in the image; on the coefficients of the discrete wavelet transform by analyzing the content of the image in the frequency domain and the spatial domain, and using these coefficients to measure the level of focus; on the discrete cosine transform (DCT), on the median filter and statistical methods (local variance, texture, etc.). In [31], the comparative evaluation of different focus measurement operators is reported. Once the focus measurement operator has been defined, the depth estimate of each point (x, y) of the surface is obtained from the set of focusing measurements related to the sequence of m images acquired according to the scheme of Fig. 5.27. For each image of the sequence, the focusing measure is calculated with (5.98) (or with other measurement methods) for each pixel using a support window x,y (of

472

5 Shape from Shading

size n × n ≥ 3 × 3) centered in the pixel (x, y) in elaboration. We now denote with dm i (x, y) the depth of a point (x, y) of the surface, where we have the maximum value of the focusing measure between all the measures S M L i (x, y) of the pixels corresponding to the m images of the sequence. The depth map obtained for all points on the surface is given by the following: dm i (x, y) = arg maxi {S M L i (x, y)}

i = 1, . . . , m

(5.99)

If the sequence of images is instead obtained by continuously varying the parameters of the optical-sensor system, for example, by varying the distance between the optic and the sensor, the depth values for each point of the surface are calculated, selecting from the sequence S M L i (x, y) the image that corresponds to the maximum measurement of focus. This determines the distance optical-sensor and applying the formula of thin lenses (5.84), also notes the focal length of the optics, it is possible to calculate the depth. SfF cannot be used for all applications. The S M L measure assumes that the focused image of an object is entirely flat as it happens in microscope applications. For more complex objects the depth map is not accurate. Subbarao and Choi [30] have proposed a different approach to SfF trying to overcome the limitations of SML by introducing the concept called F I S (Focused Image Surface), which approximates the surface of a 3D object. In essence, the object is represented by the surface F I S or from the set of points of the object which in this case are flat surface elements (patches). Starting from the initial estimate of F I S, the focusing measure is recalculated for all possible flat surface patches in the small cubic volume formed by the sequence of images. Patches with the highest focus measurement are selected to extract the F I S surface. The research process is based on a brute force approach and requires considerable calculation. For the reconstruction of complex surfaces, the traditional approaches of SfF do not give good results and have the drawback of being normally slow and require considerable computational load.

5.7.2 Shape from Defocus (SfD) In the previous paragraph, we have seen the method of focusing based on setting the parameters of the optical-sensor–object system according to the formula of thin lenses (5.84) to have an image of the scene in focus. We have also seen what the parameters are and how to model the process of defocusing images. We will take up this last aspect in order to formulate the method of the S f D which relates the object–optical distance (depth), the sensor–optical parameters, and the parameters that control the level of blur to derive the depth map. Pentland [27] has derived from (5.84) an equation that relates the radius r of the blurred circular spot with the depth p of a scene point. We analyze this relationship to extract a dense depth map with the S f D approach. Returning to Fig. 5.26, if the sensor plane does not coincide with the focal image plane, a blurred image is obtained in the sensor plane Is where each bright point of

5.7 Shape from (de)Focus

473

the scene is a blurred spot involving a circular pixel window known precisely as a circle of confusion of radius r . We have seen in the previous paragraph the relation (5.85), which links the radius r of this circle with the circular opening of the lens of radius R, the translation δ of the sensor plane with respect to the focal image plane and the distance i between the sensor plane P S and the lens center. The figure shows the two situations in which the object is of p in front of the object plane P O (on this plane, it would be perfectly in focus in the focal plane P F) and the opposite situation with the object translated of p but closer to the lens. In the two situations, according to Fig. 5.26 the translation δ would result in the following: δ =i −q

δ =q −i

(5.100)

where i indicates the distance between the lens and the sensor plane and q the distance of the focal image plane from the lens. A characteristic of the optical system is given by the so-called f/number, here indicated with f # = f /2R which expresses the ratio between the focal f and the diameter 2R of the lens (described in Sect. 4.5.1 Vol. I). If we express the radius R of the lens in terms of f # , the f/number, we have R = 2 ff# which together with the first equation of (5.100) let’s replace in (5.85), we get the following relation for the radius r of the blurred circle: r=

f ·i − f ·q 2 f# · q

(5.101)

In addition, resolving with respect to q from (5.101) and replacing in the thin lens formula (5.84), q is eliminated, and we get the following: r=

p(i − f ) − f · i 2 f# · p

Solving from the depth p from the (5.102), we finally get  f ·i if δ =i −q f # ·r p = i− f −2 f ·i if δ =q −i i− f +2 f # ·r

(5.102)

(5.103)

It is pointed out that Eq. (5.103), valid in the context of geometric optics, relate the calculation of the depth p of a point of the scene with the corresponding radius r of the blurring circle. Furthermore, Pentland proposed to consider the size of σh of the Gaussian PSF h σ proportional to the radius r of the blurring circle less than a factor k: (5.104) σh = k · r with k to be determined experimentally in relation to the acquisition system used (according to the characteristics of the optics and to the resolution of the sensor). Figure 5.28 shows the graph that relates the theoretical r radius of the blurring circle to the depth p of an object, according to (5.103), considering an optic with focal f = 50 mm set to f # = 4 (remember the dimensionless of f # ), with the object well

474 Fig. 5.28 Graph showing the dependence of the radius of the blurring circle r with the depth p for an optic with focal f = 25 mm, f/number 4. The main distance i of the sensor plane remains constant and the image is well focused at the distance of 1 m

Radius of the blurring circle (mm)

5 Shape from Shading 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

5

10

15

20

25

30

Depth (x100 mm)

35

40

45

focused at 1 m and with the distance i of the sensor plane that remains constant while varying depth p. With (5.103) and (5.104), the problem of calculating the depth p is led back to the estimation of the blurring parameter σh and the radius of the blurring circle r once known (through calibration) the intrinsic parameters ( f, f # , i) and the extrinsic parameter (k) of the acquisition system. In fact, Pentland [27] has proposed a method that requires the acquisition of at least two images with different settings of the system parameters to detect the different levels of blurring and derive with the (5.103) a depth estimate. To calibrate the system, an image of the perfectly focused scene (zero blurring) is initially acquired by adequately setting the acquisition parameters (large f # ) against which the system is calibrated. We assume an orthographic projection of the scene (pinhole model) to derive the relation between the function of blurring and the parameters of setting of the opticalsensor system. The objective is to estimate the depth by evaluating the difference of the PSF between a pair of images blurred by this, the denomination of SfD. The idea of Pentland is to emulate the human vision that is capable of evaluating the depth of the scene based on similar principles since the focal length of the human visual system is variable in a sinusoidal way around the frequencies of 2 H z. In fact, the blurring model considered is the one modeled with the convolution (5.86) between the image of the perfectly focused scene and the Gaussian blurring function (5.87) which in this case is indicated with h σ ( p,e) (x, y), to indicate that the dependency of the defocusing (blurring) depends on the distance p of the scene from the optics and from the setting parameters e = (i, f, f # ) of the optical-sensor system. We rewrite the blurring model given by the convolution equations and the Gaussian PSF as follows: Is (x, y) = I f (x, y)  h σ ( p,e) (x, y) with h σ ( p,e) (x, y) =

1 − e 2π σh2

x 2 +y 2 2σh2

(5.105)

(5.106)

5.7 Shape from (de)Focus

475

Analyzing these equations, it is observed that the defocused image Is (x, y) is known with the acquisition while the parameters e are known with the calibration of the system. The depth p and the image in focus Is (x, y) are instead unknown. The idea is to acquire at least two defocused images with different settings of the e1 and e2 system parameters to get at least a theoretical estimate of the depth p. (5.105) is not linear with respect to the unknown p, and therefore cannot be used to directly solve the problem even for a numerical stability problem if a minimization functional is desired. Pentland proposed to solve the unknown p by operating in the Fourier domain. In fact, if we consider the two defocused images acquired, modeled by the two spatial convolutions, we have Is1 (x, y) = Is (x, y)  h σ 1 (x, y)

Is2 (x, y) = Is (x, y)  h σ 2 (x, y) (5.107)

and executing the ratio of the relative Fourier transforms (remembering the (5.106)), we obtain  1 I f (u, v)H σ1 (u, v) Is1 (u, v) H σ1 (u, v) = = = exp − (u 2 + v 2 )(σ12 − σ22 ) Is2 (u, v) I f (u, v)H σ2 (u, v) H σ2 (u, v) 2

(5.108)

where σ1 = σ ( p, e1 ) and σ2 = σ ( p, e2 ). Applying now the natural logarithm to both extreme members of (5.108), we get ln

1 Is1 (u, v) = (u 2 + v2 )[σ 2 ( p, e2 ) − σ 2 ( p, e1 )] Is2 (u, v) 2

(5.109)

where the ideal image perfectly in focus Is is deleted. Known the transforms Is1 and Is2 , and calibrating the functions σ ( p, e1 ) and σ ( p, e2 ) it is possible to derive from (5.109) the term (σ12 − σ22 ) given by !

σ12

− σ22

1 Is1 (u, v) = −2 2 ln 2 u +v Is2 (u, v)

" (5.110) W

where the term • denotes an average calculated over an extended area W of the spectral domain, instead of considering single frequencies (u, v) of a point in that domain. If one of the images is perfectly in focus, we have σ1 = 0 and σ2 is estimated by (5.110) while the depth p is calculated with (5.100). If, on the other hand, the two images are defocused due to the different settings of the system parameters, we will have σ1 > 0 and σ2 > 0, two different values of the distance from the image plane and from the lens i 1 and i 2 . Substituting these values in (5.103) we have p=

f · i1 f · i2 = i 1 − f − 2r1 f # i 2 − f − 2r2 f #

(5.111)

and considering the proportionality relation σh = kr we can derive a linear relationship between σ1 and σ2 , given by: σ1 = ασ2 + β

(5.112)

476

5 Shape from Shading

  f · i1 · k 1 1 β= − 2 f# i2 i1

where: i1 α= i2

and

(5.113)

In essence, we now have two equations that establish a relationship between σ1 and σ2 , the (5.112) in terms of the known parameters of the optical-sensor system, the (5.110) in terms of the level of blur between the two defocused images derived from convolutions. Both are useful for determining depth. In fact, from the (5.110), we have the contribution that σ12 + σ22 = C, replacing in this the value of σ1 given by (5.112), we get an equation with only one unknown to calculate σ2 , given by [32]: (α 2 − 1)σ22 + 2αβσ2 + β 2 = C where: C=

1 A

  W

−2 Is1 (u, v) dudv ln 2 +v Is2 (u, v)

u2

(5.114)

(5.115)

The measurement of the defocusing difference C = σ12 − σ22 in the Fourier domain is calculated considering the average of the values of the frequencies on a window W centered in the point being processed (u, v) of images and A is the area of the window W . With (5.114), we have a quadratic equation to estimate σ2 . If the main distances are i 1 = i 2 , we have that α = 1 obtaining a single value of σ2 . Once the parameters of the optical-sensor system are known, the depth p can be calculated with one of the two Eq. (5.103). The procedure is repeated for each pixel of the image thus obtaining a depth-dense map having acquired only two defocused images with different settings of the acquisition system. The approaches of S f D described are essentially based on the measurement of the defocusing level of multiple images with different settings of the parameters of the acquisition system. This measurement is estimated for each pixel often considering also the pixels of the surroundings included in a square window of adequate dimensions, assuming that the projected points of the scene have constant depth. The use of this local window also tends to mediate noise and minimize artifacts. In the literature, S f D methods have been proposed based on global algorithms that operate simultaneously on the whole image in the hypothesis that image intensity and shape are spatially correlated although the image formation process tends to lose intensity-shape information. This leads to the typical ill-posed problem often proposing solutions based on the regularization [33], which introduce minimization functions thus bringing back a problem ill posed in a problem of numerical approximation or energy minimization, or formulated with Markov random field (MRF) [34], or through a diffusion process based on differential equations [35].

References

477

References 1. B.K.P. Horn, Shape from Shading: A Method for Obtaining the Shape of a Smooth Opaque Object from One View. Ph.D. thesis (MIT, Boston-USA, 1970) 2. E. Mingolla, J.T. Todd, Perception of solid shape from shading. Biol. Cybern. 53, 137–151 (1986) 3. V.S. Ramachandran, Perceiving shape from shading. Sci. Am. 159, 76–83 (1988) 4. K. Ikeuchi, B.K.P. Horn, Numerical shape from shading and occluding boundaries. Artif. Intell. 17, 141–184 (1981) 5. A.P. Pentland, Local shading analysis. IEEE Trans. Pattern Anal. Mach. Intell. 6, 170–184 (1984) 6. R.J. Woodham, Photometric method for determining surface orientation from multiple images. Opt. Eng. 19, 139–144 (1980) 7. H. Hayakawa, Photometric stereo under a light source with arbitrary motion. J. Opt. Soc. Am.-Part A: Opt., Image Sci., Vis. 11(11), 3079–3089 (1994) 8. P.N. Belhumeur, D.J. Kriegman, A.L. Yuille, The bas-relief ambiguity. J. Comput. Vis. 35(1), 33–44 (1999) 9. B. Horn, M.J. Brooks, The variational approach to shape from shading. Comput. Vis., Graph. Image Process. 33, 174–208 (1986) 10. E.N. Coleman, R. Jain, Obtaining 3-dimensional shape of textured and specular surfaces using four-source photometry. Comput. Graph. Image Process. 18(4), 1309–1328 (1982) 11. K. Ikeuchi, Determining the surface orientations of specular surfaces by using the photometric stereo method. IEEE Trans. Pattern Anal. Mach. Intell. 3(6), 661–669 (1981) 12. R.T. Frankot, R. Chellappa, A method for enforcing integrability in shape from shading algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 10, 439–451 (1988) 13. R. Basri, D.W. Jacobs, I. Kemelmacher, Photometric stereo with general, unknown lighting. Int. J. Comput. Vis. 72(3), 239–257 (2007) 14. T. Wei, R. Klette, On depth recovery from gradient vector fields, in Algorithms, Architectures and Information Systems Security, ed. by B.B. Bhattacharya (World Scientific Publishing, London, 2009), pp. 75–96 15. K. Reinhard, Concise Computer Vision, 1st edn. (Springer, London, 2014) 16. J.L. Posdamer, M.D. Altschuler, Surface measurement by space-encoded projected beam systems. Comput. Graph. Image Process. 18(1), 1–17 (1982) 17. E. Horn, N. Kiryati, Toward optimal structured light patterns. Int. J. Comput. Vis. 17(2), 87–97 (1999) 18. D. Caspi, N. Kiryati, J. Shamir, Range imaging with adaptive color structured light. IEEE Trans. PAMI 20(5), 470–480 (1998) 19. P.S. Huang, S. Zhang, A fast three-step phase shifting algorithm. Appl. Opt. 45(21), 5086–5091 (2006) 20. J. Guhring, ¨ Dense 3-D surface acquisition by structured light using off-the-shelf components. Methods 3D Shape Meas. 4309, 220–231 (2001) 21. C. Brenner, J. Bohm, ¨ J. Guhring, ¨ Photogrammetric calibration and accuracy evaluation of a cross-pattern stripe projector, in Videometrics VI 3641 (SPIE, 1999), pp. 164–172 22. Z.J. Geng, Rainbow three-dimensional camera: new concept of high-speed three-dimensional vision systems. Opt. Eng. 35(2), 376–383 (1996) 23. K.L. Boyer, A.C. Kak, Color-encoded structured light for rapid active ranging. IEEE Trans. PAMI 9(1), 14–28 (1987) 24. M. Maruyama, S. Abe, Range sensing by projecting multiple slits with random cuts. IEEE Trans. PAMI 15(6), 647–651 (1993) 25. Z. Zhang, A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000) 26. S.K. Nayar, Y. Nakagawa, Shape from focus. IEEE Trans. PAMI 16(8), 824–831 (1994)

478

5 Shape from Shading

27. A.P. Pentland, A new sense for depth of field. IEEE Trans. Pattern Anal. Mach. Intell. 9(4), 523–531 (1987) 28. M. Subbarao, Efficient depth recovery through inverse optics, in Machine Vision for Inspection and Measurement (Academic press, 1989), pp. 101–126 29. E. Krotkov, Focusing. J. Comput. Vis. 1, 223–237 (1987) 30. M. Subbarao, T.S. Choi, Accurate recovery of three dimensional shape from image focus. IEEE Trans. PAMI 17(3), 266–274 (1995) 31. S. Pertuza, D. Puiga, M.A. Garcia, Analysis of focus measure operators for shape-from-focus. Pattern Recognit. 46, 1415–1432 (2013) 32. C. Rajagopalan, Depth recovery from defocused images, in Depth From Defocus: A Real Aperture Imaging Approach (Springer, New York, 1999), pp. 14–27 33. V.P. Namboodiri, C. Subhasis, S. Hadap. Regularized depth from defocus, in ICIP (2008), pp. 1520–1523 34. A.N. Rajagopalan, S. Chaudhuri, An mrf model-based approach to simultaneous recovery of depth and restoration from defocused images. IEEE Trans. PAMI 21(7), 577–589 (1999) 35. P. Favaro, S. Soatto, M. Burger, S. Osher, Shape from defocus via diffusion. IEEE Trans. PAMI 30(3), 518–531 (2008)

6

Motion Analysis

6.1 Introduction So far we have considered the objects of the world and the observer both stationary, that is, not in motion. We are now interested in studying a vision system capable of perceiving the dynamics of the scene, in analogy to what happens, in the vision systems of different living beings. We are aware, that these latter vision systems require a remarkable computing skills, instant by instant, to realize the visual perception, through a symbolic description of the scene, deriving various information of depth and form, with respect to the objects themselves of the scene. For example in the human visual system, the dynamics of the scene is captured by stereo binocular images slightly different in time, acquired simultaneously by two eyes, and adequately combined to produce a single 3D perception of the objects of the scene. Furthermore, observing the scene over time, he is able to reconstruct the scene completely, differentiating 3D objects in movement from stationary ones. In essence, it realizes the visual tracking of moving objects, deriving useful qualitative and quantitative information on the dynamics of the scene. This is possible given the capacity of biological systems to manage spatial and temporal information through different elementary processes of visual perception, adequate and fundamental, for interaction with the environment. The temporal dimension in visual processing plays a role of primary importance for two reasons: 1. the apparent motion of the objects in the image plane is an indication to understand the structure and 3D motion; 2. the biological visual systems use the information extracted from time-varying image sequences to derive properties of the 3D world with a little a priori knowledge of the same. The motion analysis has long been used as a specialized field of research that had nothing to do with image processing in general; this is for two reasons: © Springer Nature Switzerland AG 2020 A. Distante and C. Distante, Handbook of Image Processing and Computer Vision, https://doi.org/10.1007/978-3-030-42378-0_6

479

480

6 Motion Analysis

1. the techniques used to analyze movement in image sequences were quite different; 2. the large amount of memory and computing power required to process image sequences made this analysis available only to specialized research laboratories that could afford the necessary resources. These two reasons no longer exist because the methods used in motion analysis do not differ from those used for image processing and the image sequence analysis algorithms can also be developed by normal personal computers. The perception of movement in analogy to other visual processes (color, texture, contour extraction, etc.) is an inductive visual process. Visual photoreceptors derive motion information by evaluating the variations in light intensity of the 2D (retinal) image formed in the observed 3D world. The human visual system adequately interprets these changes in brightness in time-varying image sequences to realize the perception of moving 3D objects in the scene. In this chapter, we will describe how it is possible to derive 3D motion, almost in real time, from the analysis of time-varying 2D image sequences. Some studies on the analysis of movement have shown that the perception of movement derives from the information of objects by evaluating the presence of occlusions, texture, contours, etc. Psychological studies have shown that the visual perception of movement is based on the activation of neural structures. In some animals, it has been shown that the lesion of some parts of the brain has led to the inability to perceive movement. These losses of visual perception of movement were not associated with the loss of visual perception of color and sensitivity to perceive different patterns. This suggests that some parts of the brain are specialized for movement perception (see Sect. 4.6.4 describing the functional structure of the visual cortex). We are interested in studying the perception of movement that occurs in the physical reality and not in the apparent movement. A typical example of apparent motion is when you observe advertising light panels, which light up and off at different times, sequences of bright zones compared to others, which always remain on. Other objects of apparent motion can be verified by varying the color or luminous intensity of some objects. Figure 6.1 shows two typical examples of movement, captured when a photo is taken with the object of interest moving to the side and the person taking pictures remains still (photo (a)) while in photo (b) is the observer moving toward the house. The images a and b of Fig. 6.2 show two images (normally in focus) of a video sequence acquired at the frequency of 50 Hz. Some differences between the images are evident in the first direct comparison. If we subtract the images from each other, the differences become immediately visible as seen in Fig. 6.2c. In fact, the dynamics of the scene indicate the movement of the players near the door. From the difference image c, it can be observed that all the parts not in motion are in black (in the two images the intensity remained constant), while the moving parts are well highlighted and one appreciates just a different speed of the moving parts (for example, the goalkeeper moves less than the other players). Even from this qualitative description, it is obvious that motion analysis helps us considerably in understanding the dynamics of a scene. All the parts not in motion of the scene are

6.1 Introduction

481

Fig. 6.1 Qualitative motion captured from a single image. Example of lateral movement captured in the photo (a) by a stationary observer, while in photo (b) the perceived movement is of the observer moving toward the house

Fig. 6.2 Couple of a sequence of images acquired with the frequency of 50 Hz. The dynamics of the scene indicates the approach of the players toward the ball. The value of the image difference (a) and (b) is shown in (c)

dark while with the value of the difference the moving parts are exalted (the areas with brighter pixels). We can summarize what has been said by affirming that the movement (of the objects in the scene or of the observer) can be detected by the temporal variation of the gray levels; unfortunately, the inverse implication is not valid, namely that all changes in gray levels are due to movement. This last aspect depends on the possible simultaneous change of the lighting conditions while the images are acquired. In fact, in Fig. 6.2 the scene is well illuminated by the sun (you can see the shadows projected on the playing field very well), but if a cloud suddenly changes the lighting conditions, it would not be possible to derive motion information from the time-varying image sequence because the gray levels of the images also change due to the change in lighting. Thus a possible analysis of the dynamics of the scene, based on the difference of the space–time-variant images, would result in artifact. In other words, we can say that from a sequence of space–time images varying f (x, y, t), it is possible to derive motion information by analyzing

482

6 Motion Analysis

the gray-level variations between pairs of images in the sequence. Conversely, it cannot be said that any change in the levels of gray over time can be attributed to the motion of the objects in the scene.

6.2 Analogy Between Motion Perception and Depth Evaluated with Stereo Vision Experiments have shown the analogy between motion perception and depth (distance) of objects derived from stereo vision (see Sect. 4.6.4). It can easily be observed as for an observer in motion, with respect to an object, in some areas of the retina there are local variations of luminous intensity, deriving from the motion, containing visual information that depends on the distance of the observer from the various points of the scene. The variations in brightness on the retina change in a predictable manner in relation to the direction of motion of the observer and the distance between objects and observer. In particular, objects that are farther away generally appear to move more slowly than objects closer to the observer. Similarly, points along the direction of motion of the observer move slowly with respect to points that lie in other directions. From the variations of luminous intensity perceived on the retina the position information of the objects in the scene with respect to the observer is derived. The motion field image of the movement is calculated from the motion information derived from a sequence of time-varying images. Gibson [1] has defined an important algorithm that correlates the perception of movement and the perception of distance. The algorithm calculates the flow field of the movement and proposes algorithms to extract the ego-motion information, that is, it derives the motion of the observer and the depth of the objects from the observer, analyzing the information of the flow field of the movement. Motion and depth estimation have the same purpose and use similar types of perceptive stimuli. The stereo vision algorithms use the information of the two retinas to recover the depth information, based on the diversity of the images obtained by observing the scene from slightly different points of view. Instead, motion detection algorithms use coarser information deriving from image sequences that show slight differences between them due to the motion of the observer or to the relative motion of the objects with respect to the observer. As an alternative to determining the variations of gray levels for motion estimation, in analogy to the correspondence problem for stereo vision, some characteristic elements (features) can be identified in the sequence image that corresponds to the same objects of the observed scene and evaluate, in the image, the spatial difference of these features caused by the movement of the object. While in stereo vision this spatial difference (called disparity) of a f eatur e is due to the different position of observation of the scene, in the case of the sequence of timevarying images acquired by the same stationary observer, the disparity is determined by images consecutive acquired at different times. The temporal frequency of image

6.2 Analogy Between Motion Perception and Depth Evaluated …

483

acquisition, normally realized with standard cameras, is 50 or 60 images per second (often indicated with fps—frame per second), i.e., the acquisition of an image every 1/25 or 1/30 of a second. There are nonstandard cameras that reach over 1000 images per second and with different spatial resolutions. The values of spatial disparity, in the case of time-varying image sequences, are relatively smaller than those of stereo vision. The problem of correspondence is caused by the inability to unambiguously find corresponding points in two consecutive images of a sequence. A first example occurs in the case of deformable objects or in the case of small particles moving in a flow field. Given their intrinsic relative motion, it is not possible to make any estimation of the displacements because there are no visible features that we can easily determine together with their relative motion that changes continuously. We could assume that the correspondence problem will not exist for rigid objects that show variations in the gray levels not due to the change in lighting. In general, this assumption is not robust, since even for rigid objects there is ambiguity when one wants to analyze the motion of periodic and nonperiodic structures, with a local operator not able to find univocally the correspondence of these structures for their vision partial. In essence, since it is not possible to fully observe these structures, the detection of the displacement would be ambiguous. This subproblem of correspondence is known in the literature as the aperture problem which we will analyze in the following paragraphs. At a higher level of abstraction, we can say that physical correspondence, i.e., the actual correspondence of real objects, may not be identical to the visual correspondence of the image. This problem has two aspects: 1. we can find the visual correspondence without the existence of a physical correspondence, as in the case of indistinguishable objects; 2. a physical correspondence does not generally imply a visual correspondence, as in the case in which we are not able to recognize a visual correspondence due to variations in lighting. In addition to the correspondence problem, the motion presents a further subproblem, known as r econstr uction, defined as follows: given a number of corresponding elements and, possibly, knowledge of the intrinsic parameters of the camera, what can be said about the 3D motion and the observed world structure? The methods used to solve the correspondence and reconstruction problems are based on the following assumption: there is only one, rigid, relative motion between the camera and the observed scene, moreover the lighting conditions do not change.

484

6 Motion Analysis

This assumption implies that the observed 3D objects cannot move according to different motions. If the dynamics of the scene consist of multiple objects with different movements than the observer, another problem will have to be considered, that is, you will have to segment the flow image of the movement to select the individual regions that correspond to the different objects with different motions (the problem of segmentation).

6.3 Toward Motion Estimation In Fig. 6.3, we observe a sequence of images acquired with a very small time interval t such that the difference is minimal in each consecutive pair of the images sequence. This difference in the images depends on the variation of the geometric relationships between the observer (for example, a camera), the objects of the scene, and the light source. It is these variations, determined in each pair of images of the sequence, at the base of the motion estimation and stereo vision algorithms. In the example shown in the figure, the dynamic of the ball is the main objective in the analysis of the sequence of images which, once detected in the spatial domain of an image, is tracked in the successive images of the sequence (time domain) while approaching the door to detect the Goal–NoGoal event. Let P(X, Y, Z ) be a 3D point of the projected scene (with pin-hole model) in the time t · t in a point p(x, y) of the image I (x, y, t · t) (see Fig. 6.4). If the 3D motion of P is linear and with speed V = (V X , VY , VZ ), in the time interval t will move by V·t in Q = (X + V X ·t, Y + VY ·t, Z + VZ ·t). The motion of P with speed V induces the motion of p in the image plane at the speed v(vx , v y ) moving to the point q(x + vx · t, y + v y · t) of image I (x, y, t + 1 · t). The apparent motion of the intensity variation of the pixels is called optical flow v = (vx , v y ). y

t

x

Fig. 6.3 Goal–NoGoal detection. Sequence of space–time-variant images and motion field calculated on the last images of the sequence. The main object of the scene dynamics is only the motion of the ball

6.3 Toward Motion Estimation

Y

y p

q

485

x

X Q

0

Z P

Fig. 6.4 Graphical representation of the formation of the velocity flow field produced on the retina (by perspective projection) generated by the motion of an object of the scene considering the observer stationary

Moving towards object

Moving away from object

Rotation

Translation from right to left

Fig. 6.5 Different types of ideal motion fields induced by the motion of the observer toward the object or vice versa

Figure 6.5 shows ideal examples of optical flow with different vector fields generated by various types of motion of the observer (with uniform speed) which, with respect to the scene, approaches or moves away, or moves laterally from right to left, or rotates the head. The flow vectors represent an estimate of the variations of points in the image that occur in a limited space–time interval. The direction and length of each flow vector corresponds to the local motion entity that is induced when the observer moves with respect to the scene or vice versa, or both move. The projection of velocity vectors in the image plane, associated with each 3D point of the scene, defines the motion field. Ideally, motion field and optical flow should coincide. In reality, this is not true since the motion field associated with an optical flow can be caused by an apparent motion induced by the change of the lighting conditions and not by a real motion or by the aperture problem mentioned above. An effective example is given by the barber pole1 illustrated in Fig. 6.6, where the real motion of the cylinder is circular and the perceived optical flow is a vertical

1 Panel used in the Middle Ages by barbers. On a white cylinder, a red ribbon is wrapped helically. The

cylinder continuously rotates around its vertical axis and all the points of the cylindrical surface move horizontally. It is observed instead that this rotation produces the illusion that the red ribbon moves vertically upwards. The motion is ambiguous because it is not possible to find corresponding points in motion in the temporal analysis as shown in the figure. Hans Wallach in 1935, a psychologist,

486

6 Motion Analysis

Perceived Motion

Real Motion

Motion Field

Optical Flow

Aperture Problem

Fig. 6.6 Illusion with the barber pole. The white cylinder, with the red ribbon wrapped helically, rotates clockwise but the stripes are perceived to move vertically upwards. The perceived optical flow does not correspond to the real motion field, which is horizontal from right to left. This illusion is caused by the aperture problem or by the ambiguity of finding the correct correspondence of points on the edge of the tape (in the central area when observed at different times) since the direction of motion of these points is not determined uniquely by the brain

motion field when the real one is horizontal. The same happens when standing in a stationary train observing another adjacent train and we have the feeling that we are moving when instead it is the other that is moving. This happens because we have a limited opening from the window and we don’t have precise references to decide which of the two trains is really in motion. Returning to the sequence of space-time-variant images I (x, y, t), the information content captured of the dynamics of the scene can be effective to analyze it in a space–time graph. For example, with reference to Fig. 6.3, we could analyze in the space–time diagram (t, x) the dynamics of the ball structure (see Fig. 6.7). In the sequence of images, the dominant motion of the ball is horizontal (along the x axis) moving toward the goalposts. If the ball is stationary, the position in the time-variant images does not change and in the diagram this state is represented by a horizontal line. When the ball moves at a constant speed the trace described by its center of mass is an oblique straight line and its inclination with respect to the time axis depends on the speed of the ball and is given by ν=

x = tan(θ ) t

(6.1)

where θ is the angle between the time axis t and the direction of movement of the ball given by the mass centers of the ball located in the time-varying images of the sequence or from the trajectory described by the displacement of the ball in the images of the sequence. As shown in Fig. 6.7, a moving ball is described in the plane (t, x) with an inclined trajectory while for a stationary ball, in the sequence

discovered that the illusion is less if the cylinder is shorter and wider, and the perceived motion is correctly lateral. The illusion is also solved if the texture is present on the tape.

6.3 Toward Motion Estimation

487

Stationary ball

Motion with uniform velocity

x

x θ msec

t

msec

t

Fig. 6.7 Space–time diagram of motion information. In the diagram on the left the horizontal line indicates stationary motion while on the right the inclined line indicates motion with uniform speed along the x axis

of images, the gray levels associated with the ball do not vary, and therefore in the plane (t, x) we will see the trace of the motion of the ball which remains horizontal (constant gray levels). In other words, we can say that in the space–time (x, y, t) the dynamics of the scene is estimated directly from the orientation in continuous space–time (t, x) and not as discrete shifts by directly analyzing two consecutive images in the space (x, y). Therefore, the motion analysis algorithms should be formulated in the continuous space–time (x, y, t) for which the level of discretization to adequately describe motion becomes important. In this space, observing the trace of the direction of motion and how it is oriented with respect to the time axis, an estimate of the speed is obtained. On the other hand, by observing only the motion of some points of the contour of the object, the orientation of the object itself would not be univocally obtained.

6.3.1 Discretization of Motion In physical reality, the motion of an object is described by continuous trajectories and characterized by its relative velocity with respect to the observer. A vision machine to adequately capture the dynamics of the object will have to acquire a sequence of images with a temporal resolution correlated to the speed of the object itself. In particular, the acquisition system must operate with a suitable sampling frequency, in order to optimally approximate the continuous motion described by the object. For example, capturing the motion of a car in a high-speed race (300 km/h) it is necessary to use vision systems with high acquisition time frequency (for example, using cameras with acquisition speeds even higher than 1000 images per second, in the literature also known as frame rate, to indicate the number of images acquired in a second). Another reason that affects the choice of sampling time frequency concerns the need to display stable images. In the television and film industry, the most advanced technologies are used to reproduce dynamic scenes obtaining excellent results in the stability of television images and the dynamics of the scene reproduced with the appearance of the continuous movement of objects. For example, the European

488

6 Motion Analysis

(a)

(b) t

tn

Goa

x l Bo

y

α Y

l Goa

Z

α tgoal

Are

a

0

x

e

l Lin

Goa

X

Fig. 6.8 Goal–NoGoal event detection. a The dynamics of the scene is taken for each goal by a pair of cameras with high temporal resolution arranged as shown in the figure on the opposite sides with the optical axes (aligned with the Z -axis of the central reference system (X, Y, Z )) coplanar with the vertical plane α of the goal. The significant motion of the ball approaching toward the goal box is in the domain time–space (t, x), detected with the acquisition of image sequences. The 3D localization of the ball with respect to the central reference system (X, Y, Z ) is calculated by the triangulation process carried out by the relative pair of opposite and synchronized cameras, suitably calibrated, with respect to the known positions of the vertical plane α and the horizontal goal area. b Local reference system (x, y) of the sequence images

television standard uses a time frequency of 25 Hz displaying 25 static images per second. The display technology, combined with the persistence of the image on the retina, performs two video scans to present a complete image. In the first scan, the even horizontal lines are drawn (first frame) and in the second scan the odd horizontal lines (second frame) are drawn. Altogether there is a temporal frequency of 50 frames per second to reproduce 25 static images corresponding to a spatial resolution of 525 horizontal lines. To improve stability (i.e., attenuate the flickering of the image on the monitor) the reproduced video images are available with digital technology monitors that have a temporal frequency even higher than 1000 Hz. Considering this, let us now consider the criterion with which to choose the time frequency of image acquisition. If the motion between observer and scene is null, to reproduce the scene it is sufficient to acquire a single image. If instead, we want to reproduce the dynamics of a sporting event as in the game of football to determine if the ball has completely crossed the vertical plane of the goalposts (in the affirmative case we have the goal event) it is necessary to acquire a sequence of images with a temporal frequency adequate with the speed of the ball. Figure 6.8 shows the arrangement of the tele-cameras to acquire the sequence of images of the Goal–NoGoal event [2–5]. In particular, the cameras are positioned at the corners of the playing field with the optical axis coplanar to the vertical plane passing through the inner edges of the goalposts (poles and crossbar). The pair of opposite cameras are synchronized and simultaneously observe the dynamics of the scene. As soon as the ball enters the field of view of the cameras, it is located and begins its tracking in the two sequences of images to monitor the approach of the ball toward the goal box.

6.3 Toward Motion Estimation

489

The goal event occurs only when the ball (about 22 cm diameter) completely crosses the goal box (plane α), that is, it completely exceeds the goalposts-crossbar and goal line inside the goal as shown in the figure. The ball can reach a speed of 120 Km/h. To capture the dynamics of the goal event, it is necessary to acquire a sequence of images by observing the scene as shown in the figure, from which it emerges that the significant and dominant motion is the lateral one with the trajectory of the ball, which moves toward the goal, it is almost always orthogonal to the optical axis of the cameras. Figure 6.8 shows the dynamics of the scene being acquired by discreetly sampling the motion of the ball over time. As soon as the ball appears in the scene, this is detected in a I (x, y, t1 ) image of the time–space sequence varying at time t1 and the lateral motion of the ball is tracked in the spatial domain (x, y) of the images of the sequence acquired in real time with the frame rate defined by the camera that characterizes the level of temporal discretization t of the dynamics of the event. In the figure, we can observe the 3D and 2D discretized trajectory of the ball for some consecutive images of the two sequences captured by the two opposite and synchronized cameras. It is also observed that the ball is spatially spaced (in consecutive images of the sequence) with a value inversely proportional to the frame rate. We will now analyze the impact of time sampling with the dynamics of the goal event that we want to detect. In Fig. 6.9a, the whole sequence of images (related to one camera) is represented where it is observed that the ball moves toward the goal, with a uniform speed v, leaving a cylindrical track of diameter equal to the real dimensions of the ball. In essence, in this 3D space–time of the sequence I (x, y, t), the dynamics of the scene is graphically represented by a parallelepiped where the images of the sequence I (x, y, t) that vary over time t are stacked. Figure 6.9c, which is a section (t, x) of the parallelepiped, shows the dominant and significant signal of the scene, that is, the trajectory of the ball useful to detect the goal event if the ball crosses the goal (i.e., the vertical plane α). In this context, the space–time diagram (t, y) is used to

(a)

(b)

(c)

y

x

y

yk

tk

x t

0

t1+n+m

msec

t

0

msec

t

t1+n

Fig. 6.9 Parallelepiped formed by the sequence of images that capture the Goal–NoGoal event. a The motion of the ball moving at uniform speed is represented in this 3D space–time by a slanted cylindrical track. b A cross section of the parallelepiped, or image plane (x-y) of the sequence at time tth, shows the current position of the entities in motion. c A parallelepiped section (t-x) at a given height y represents the space–time diagram that includes the significant information of the motion structure

490

6 Motion Analysis

indicate the position of the cylindrical track of the ball with respect to the goal box (see Fig. 6.9b). Therefore, according to Fig. 6.8b, we can affirm that from the time–space diagram (t, x), we can detect the image tgoal of the sequence in which the ball crossed the vertical plane α of the goal box with the x-axis indicating the horizontal position of the ball, useful for calculating its distance from the plane α with respect to the central reference system (X, Y, Z ). From the time–space diagram (t, y) (see Fig. 6.9b), we obtain instead the vertical position of the ball, useful to calculate the coordinate Y with respect to the central reference system (X, Y, Z ). Determined the centers of mass of the ball in the synchronized images IC1 (tgoal , xgoal , ygoal ) and IC2 (tgoal , xgoal , ygoal ) relative to the opposite cameras C1 and C2 we have the information that the ball has crossed the plane α useful to calculate the horizontal coordinate X of the central reference system but, to detect the goal event, it is now necessary to determine if the ball is in the goal box evaluating its position (Y, Z ) in the central reference system (see Fig. 6.8a). This is possible through the triangulation between the two cameras having previously calibrated them with respect to the known positions of the vertical plane α and the horizontal goal area [2]. It should be noted that in Fig. 6.9 in the 3D space–time representation, for simplicity the dynamics of the scene is indicated assuming a continuous motion even if the images of the sequence are acquired with a high frame rate, and in the plane (t, x) the resulting trace is inclined by an angle θ with a value directly proportional to the speed of the object. Now let’s see how to adequately sample the motion of an object to avoid the phenomenon known as time aliasing, which introduces distortions in the signal due to an undersampling. With the technologies currently available, once the spatial resolution2 is defined, the continuous motion represented by Fig. 6.9c can be discretized with a sampling frequency that can vary from a few images to thousands of images per second. In Fig. 6.10 is shown the relation between the speed of the ball and the displacement of the ball in the time interval of acquisition between two consecutive images that remains constant during the acquisition of the entire sequence. It can be observed that for low values of the frame rate, the displacement of the object varies from meters to a few millimeters, i.e., the displacement decreases with the increase of the sampling time frequency. For example, for a ball with the speed of 120 km/h the acquisition of

2 We recall that we also have the phenomenon of the spatial aliasing already described in Sect. 5.10 Vol. I. According to the Shannon–Nyquist theorem, a sampled continuous function (in the time or space domain) can be completely reconstructed if (a) the sampling frequency is equal to or greater than twice the frequency of the maximum spectral component of the input signal (also called Nyquist frequency) and (b) the spectrum replicas are removed in the Fourier domain, remaining only the original spectrum. The latter process of removal is the anti-aliasing process of signal correction by eliminating spurious space–time components.

6.3 Toward Motion Estimation

491

Relationship between velocity and object desplacement

Object Displacement (mm)

1800 25 fps 50 fps 120 fps 260 fps 400 fps 1000 fps

1600 1400 1200 1000 800 600 400 200 0

150

100

50

0

Object Velocity (km/h)

Fig.6.10 Relationship between speed and displacement of an object as the time sampling frequency of the sequence images I (x, y, t) changes

the sequence with a frame rate of 400 fps the movement of the ball in the direction of the x-axis is 5 mm. In this case, a time aliasing would occur with the appearance of the ball in the sequence images in the form of an elongated ellipsoid and one could hardly estimate the motion. The time aliasing also generates the helix effect when observing on a television video the motion of the propeller of an airplane that seems to rotate in an inverse way compared to the real one. Figure 6.11 shows how the continuous motion represented in Fig. 6.9c is better represented by sampling the dynamics of the scene with a very high time sampling frequency, while as the level of discretization of the trajectory of the ball decreases, it is very far from the continuous motion. The speed of the object in addition to

(a)

(b)

x

x

(c)

t

ut

x

t vx vx vx vx

vx

(d)

x

t vx

ut

t vx

ut

ut

Fig. 6.11 Analysis in the Fourier domain of the effects of time sampling on the 1D motion of an object. a Continuous motion of the object in the time–space domain (t, x) and in the corresponding Fourier domain (u t , vx ); in b, c and d the analogous representations are shown with a time sampling which is decreased by a factor 4

492

6 Motion Analysis

producing an increase in the movement of the object also produces an increase in the inclination of the motion trajectory in the time–space domain (t, x). Horizontal trajectory would mean the object stationary relative to the observer. Let us now analyze the possibility of finding a method [6] that can quantitatively evaluate a good compromise between spatial and temporal resolution, to choose a suitable time sampling frequency value for which the dynamics of a scene can be considered acceptable with respect to continuous motion and thus avoid the problem of aliasing. A possible method is based on Fourier analysis. The spatial and temporal frequencies described in the (t, x) domain can be projected in the Fourier frequency domain (u t , vx ). In this domain, the input signal I (t, x) is represented in the spatial and temporal frequency domain (u t , vx ), which best highlights the effects of sampling at different frequencies. In the graphical representation (t, x), the dynamics of the scene is represented by the trace described by the ball and by its inclination with respect to the horizontal axis of time which depends on the speed of the ball. In the Fourier domain (u t , vx ), the motion of the ball is still a rectilinear strip (see the second line of Fig. 6.11) whose inclination is evaluated with respect to the time frequency axis u t . From the figure, it can be observed that if the ball is stationary, the energy represented in Fourier space is zero at time frequency zero (no image is acquired). As soon as the ball moves with speed ν, each component of the spatial frequencies vx (expressed in cycles/degrees) associated with the movement of the ball are modified at the same speed and generate a temporal modulation at the frequency: u t = ν · vx

(6.2)

with u t (expressed in cycles/sec). The graphical representation in the Fourier domain (u t , vx ) is shown in the second line of Fig. 6.11, where it is shown how to modify the spectrum from the continuous motion to the decrease of the sampling frequency and its inclination with respect to the axis u t which depends on the speed of the object. The motion of the ball with different time sampling frequencies is shown in Fig. 6.11b, c, and d. It is observed how the replicas of the spectrum are spaced from each other inversely proportional to the time sampling frequency. In correspondence of the high temporal frequencies, there is a wide spacing of the replicas of the spectrum, vice versa, in correspondence with the low frequencies of temporal sampling the distance of the replicas decreases. In essence, the sampled motion generates replicas of the original continuous motion spectrum in the direction of the frequency axis u t and is easily discriminable. It is possible to define a temporal frequency Ut above which the dynamics of the scene can be considered acceptable sampling resulting in a good approximation of the continuous motion of the observed scene. When replicas fall outside this threshold value (as in Fig. 6.11b) and not detectable, the continuous motion and the sampled one are indistinguishable. The circular area of the spectrum (known as visibility area) indicated in the figure includes the sensitive temporal space frequencies of the human visual system that is able to detect temporal frequencies that are below 60 Hz (this explains the frequency

6.3 Toward Motion Estimation

493

of the devices television that is 50 Hz European and 60 Hz American) and spatial frequencies below 60 cycles/degrees. The visibility area is very useful for evaluating the correct value of the temporal sampling frequency of image sequences. If the replicas are inside the visibility window these are the cause of a very coarse motion sampling that produces visible distortions like flickering when the sequence of images is observed on the monitor. Conversely, if the replicas fall outside the visibility area, the corresponding sampling rate is adequate enough to produce a distortion-free motion. Finally, it should be noted that it is possible to estimate from the signal (t, x), in the Fourier domain, the speed of the object by calculating the slope of the line on which the spectrum of the sequence is located. The slope of the spectrum is not uniquely calculated if it is not well distributed linearly. This occurs when the gray-level structures are oriented in a nonregular way in the spatial domain. For a vision machine in general, the spatial and temporal resolution is characterized by the resolution of the sensor, expressed in pixels each with a width of W (normally expressed in micr ons μm), and from the time interval T = 1/ f t p between images of the sequence defined by the frame rate f t p of the camera. According to the Shannon–Nyquist sampling theorem, it is possible to define the maximum admissible supervised learning temporal frequencies as follows: ut ≤

π T

vx ≤

π W

(6.3)

Combining Eq. (6.2) and Eq. (6.3), we can estimate the maximum value of the horizontal component of the optical flow ν: ν=

W π ut = ≤ vx vx T T

(6.4)

defined as a function of both the spatial and temporal resolutions of the acquisition system. Even if the horizontal optical flow is determinable at the pixel resolution, in reality it depends above all on the spatial resolution with which the object is determined in each image of the sequence (i.e., by the algorithms of object detection in turn influenced by the noise) and by the accuracy with which the spatial and temporal gradient is determined.

6.3.2 Motion Estimation—Continuous Approach In the continuous approach, the motion of each pixel of the image is estimated, obtaining a dense map of estimated speed measurements by evaluating the local variations of intensity, in terms of spatial and temporal variations, between consecutive images. These speed measurements represent the apparent motion, in the two-dimensional image plane, of 3D points in the motion of the scene and projected in the image plane. In this context it is assumed that the objects of the scene are rigid, that is they all move at the same speed and that, during observation, the lighting conditions do

494 Fig. 6.12 2D motion field generated by the perspective projection (pin-hole model) of 3D motion points of a rigid body

6 Motion Analysis

Y

V

Camera

v

P

y p

Z

0

X

x Im ag eP la n

e

not change. With this assumption, we analyze the two terms: motion field and optical flow. The motion field represents the 2D speed in the image plane (observer) induced by the 3D motion of the observed object. In other words, the motion field represents the apparent 2D speed in the image plane of the real 3D motion of a point in the scene projected in the image plane. The motion analysis algorithms propose to estimate from a pair of images of the sequence and for each corresponding point of the image the value of the 2D speed (see Fig. 6.12). The velocity vector estimated at each point of the image plane indicates the direction of motion and the speed which also depends on the distance between the observer and observed objects. It should be noted immediately that the 2D projections of the 3D velocities of the scene points cannot be measured (acquired) directly by the acquisition systems normally constituted, for example, by a camera. Instead, information is acquired which approximates the motion field, i.e., the optical flow is calculated by evaluating the variation of the gray levels in the pairs of time-varying images of the sequence. The optical flow and the motion field can be considered coincident only if the following conditions are satisfied: (a) The time distance is minimum for the acquisition of two consecutive images in the sequence; (b) The gray levels’ function is continuous; (c) The conditions of Lambertianity are maintained; (d) Scene lighting conditions do not change during sequence acquisition. In reality, these conditions are not always maintained. Horn [7] has highlighted some remarkable cases in which the motion field and the optical flow are not equal. In Fig. 6.13, two cases are represented. First case: Observing a stationary sphere with a homogeneous surface (of any material), this induces optical flow when a light source moves in the scene. In this case, by varying the lighting conditions, the optical flow is detected by analyzing the image sequence since the condition (d) is violated, and therefore there is a

6.3 Toward Motion Estimation

(a)

495

(b)

O

Nul

Fig. 6.13 Special cases of noncoincidence between optical flow and motion field. a Lambertian stationary sphere induces optical flow when a light source moves producing a change in intensity while the motion field is zero as it should be. b The sphere rotates while the light source is stationary. In this case, the optical flow is zero no motion is perceived while the motion field is produced as it should be

variation in the gray levels while the motion field is null having assumed the stationary sphere. Second case: The sphere rotates around its axis of gravity while the illumination remains constant, i.e., the conditions indicated above are maintained. From the analysis of the sequence, the induced optical flow is zero (no changes in the gray levels between consecutive images are observed) while the motion field is different from zero since the sphere is actually in motion.

6.3.3 Motion Estimation—Discrete Approach In the discrete approach, the speed estimation is calculated only for some points in the image, thus obtaining a sparse map of velocity estimates. The correspondence in pairs of consecutive images is calculated only in the significant points of interest (SPI) (closed contours of zero crossing, windows with high variance, lines, texture, etc.). The discrete approach is used for certain small and large movements of moving objects in the scene and when the constraints of the continuous approach cannot be maintained. In fact, in reality, not all the abovementioned constraints are satisfied (small shifts between consecutive images do not always occur and the Lambertian conditions are not always valid). The optical flow has the advantage of producing dense speed maps and is calculated independently of the geometry of the objects of the scene unlike the other (discrete) approaches that produce sparse maps and depend on the points of interest present in the scene. If the analysis of the movement is also based on the a priori knowledge of some information of the moving objects of the scene, some assumptions are considered to better locate the objects:

496

6 Motion Analysis

Maximum speed. The position of the object in the next image after a time t can be predicted. Homogeneous movement. All the points of the scene are subject to the same motion. Mutual correspondence. Except for problems with occlusion and object rotation, each point of an object corresponds to a point in the next image and vice versa (non-deformable objects).

6.3.4 Motion Analysis from Image Difference Let I1 and I2 be two consecutive images of the sequence, an estimate of the movement is given by the binary image d(i, j) obtained as the difference between the two consecutive images:  d(i, j) = 0 if |I1 (i, j) − I2 (i, j)| ≤ S (6.5) d(i, j) = d(i, j) = 1 otherwise where S is a positive number indicating the threshold value above which to consider the presence of movement in the observed scene. In the difference image d(i, j), the presence of motion in pixels with value one is estimated. It is assumed that the images are perfectly recorded and that the dominant variations of the gray levels are attributable to the motion of the objects in the scene (see Fig. 6.14). The difference image d(i, j) which contains the qualitative information on the motion is very much influenced by the noise and cannot correctly determine the motion of very slow objects. The motion information in each point of the difference image d(i, j) is associated with the difference in gray levels between the following: – adjacent pixels that correspond to pixels of moving objects and pixels that belong to the background; – adjacent pixels that belong to different objects with different motions; – pixels that belong to parts of the same object but with a different distance from the observer; – pixels with gray levels affected by nonnegligible noise. The value of the S threshold of the gray level difference must be chosen experimentally after several attempts and possibly limited to very small regions of the scene.

6.3.5 Motion Analysis from the Cumulative Difference of Images The difference image d(i, j), obtained in the previous paragraph, qualitatively highlights objects in motion (in pixels with value 1) without indicating the direction of motion. This can be overcome by calculating the cumulative difference image

6.3 Toward Motion Estimation

497

Fig. 6.14 Motion detected with the difference of two consecutive images of the sequence and result of the accumulated differences with Eq. (6.6)

dcum (i, j), which contains the direction of motion information in cases where the objects are small and with limited movements. The cumulative difference dcum (i, j) is evaluated considering a sequence of n images, whose initial image becomes the reference against which all other images in the sequence are subtracted. The cumulative difference image is constructed as follows: dcum (i, j) =

n 

ak |I1 (i, j) − Ik (i, j)|

(6.6)

k=1

where I1 is the first image of the sequence against which all other images are compared Ik , ak is a coefficient with increasingly higher values to indicate the image of the most recently accumulated sequence and, consequently, it highlights the location of the pixels associated with the current position of the moving object (see Fig. 6.14). The cumulative difference can be calculated if the reference image I1 is acquired when the objects in the scene are stationary, but this is not always possible. In the latter case, we try to learn experimentally the motion of objects or, based on a model of motion prediction, we build the reference image. In reality, it does not always interest an image with motion information in every pixel. Often, on the other hand, it is interesting to know the trajectory in the image plane of the center of mass of the objects moving with respect to the observer. This means that in many applications it may be useful to first segment the initial image of the sequence, identify the regions associated with the objects in motion and then calculate the trajectories described by the centers of mass of the objects (i.e., the regions identified). In other applications, it may be sufficient to identify in the first image of the sequence some characteristic points or characteristic areas (features) and then search for such features in each image of the sequence through the process of matching homologous features. The matching process can be simplified by knowing or learning the dynamics of the movement of objects. In the latter case, tracking algorithms of the features can be used to make the matching process more robust and reduce the level of uncertainty in the evaluation of the motion and location of the objects. The Kalman filter [8,9] is often used as a solution to the tracking problem. In Chap. 6 Vol. II, the algorithms and the problems related to the identification of features and their research have been described, considering the aspects of noise present in the images,

498

6 Motion Analysis

including the aspects of computational complexity that influence the segmentation and matching algorithms.

6.3.6 Ambiguity in Motion Analysis The difference image, previously calculated, qualitatively estimates the presence of moving objects through a spatial and temporal analysis of the variation of the gray levels of the pixels for limited areas of the images (spatiotemporal local operators). For example, imagine that we observe the motion of the edge of an object through a circular window, as shown in Fig. 6.15a. The local operator, who calculates the graylevel variations near the edge visible from the circular window, of limited size, cannot completely determine the motion of a point of the visible edge; on the other hand, it can only qualitatively assess that the visible edge has shifted over time t but it is not possible to have exact information on the plausible directions of motion indicated by the arrows. Each direction of movement indicated by the arrows produces the same effect of displacement of the visible edge in the final dashed position. This problem is often called the aperture problem. In the example shown, the only correctly evaluable motion is the one in the direction perpendicular to the edge of the visible object. In Fig. 6.15b, however, there is no ambiguity about the determination of the motion because, in the region of the local operator, the corner of the object is visible and the direction of motion is uniquely determined by the direction of the arrows. The aperture problem can be considered as a special case of the problem of correspondence. In Fig. 6.15a, the ambiguity induced by the aperture problem derived from the impossibility to find the correspondence of homologous points of the visible edge in the two images acquired in the times t and t + t. The same correspondence problem would occur in the two-dimensional in the case of deformable objects. The aperture problem is also the cause of the wrong vertical motion perceived by humans with the barber’s pole described in Sect. 6.3.

(a)

(b)

Fig. 6.15 Aperture problem. a The figure shows the position of a line (edge of an object) observed through a small aperture at time t1 . At time t2 = t1 + t, the line has moved to a new position. The arrows indicate the possible undeterminable line movements with a small opening because only the component perpendicular to the line can be determined with the gradient. b In this example, again from a small aperture, we can see in two consecutive images the displacement of the corner of an object with the determination of the direction of motion without ambiguity

6.4 Optical Flow Estimation

499

6.4 Optical Flow Estimation In Sect. 6.3, we introduced the concepts of motion field and optical flow. Now let’s calculate the dense map of optical flow from a sequence of images to derive useful information on the dynamics of the objects observed in the scene. Recall that the variations of gray levels in the images of the sequence are not necessarily induced by the motion of the objects which, instead, is always described by the motion field. We are interested in calculating the optical flow in the conditions in which it can be considered a good approximation of the motion field. The motion estimation can be evaluated assuming that in small regions in the images of the sequence, the existence of objects in motion causes a variation in the luminous intensity in some points without the light intensity varying appreciably: constraint of the continuity of the light intensity relative to moving points. In reality, we know that this constraint is violated as soon as the position of the observer changes with respect to the objects or vice versa, and as soon as the lighting conditions are changed. In real conditions, it is known that this constraint can be considered acceptable by acquiring sequences of images with an adequate temporal resolution (a normal camera acquires sequences of images with a time resolution of 1/25 of a second) and evaluating the brightness variations in the images through the constraint of the spatial and temporal gradient, used to extract useful information on the motion. The generic point P of a rigid body (see Fig. 6.12) that moves with speed V, with respect to a reference system (X, Y, Z ), is projected, through the optical system, in the image plane in the position p with respect to the coordinate system (x, y), united with the image plane, and moves in this plane with an apparent speed v = (vx , v y ) which is the projection of the velocity vector V. The motion field represents the set of velocity vectors v = (vx , v y ) projected by the optical system in the image plane, generated by all points of the visible surface of the moving rigid object. An example of motion field is shown in Fig. 6.16. In reality, the acquisition systems (for example, a camera) do not determine the 2D measurement of the apparent speed in the image plane (i.e., they do not directly measure the motion field), but record, in a sequence of images, the brightness variations of the scene in the hypothesis that they are due to the dynamics of the scene. Therefore, it is necessary to find the physical-mathematical model that links the perceived gray-level variations with the motion field. We indicate with the following: (a) I(x, y, t) the acquired sequence of images representing the gray-level information in the image plane (x, y) in time t; (b) (I x , I y ) and It , respectively, the spatial variations (with respect to the axes x and y) and temporal variations of the gray levels. Suppose further that the space–time-variant image I(x, y, t) is continuous and differentiable, both spatially and temporally. In the Lambertian hypotheses of continuity conditions, that is, that each point P of the object appears equally luminous from any direction of observation and, in the hypothesis of small movements, we can consider

500

6 Motion Analysis

Fig. 6.16 Motion field coinciding with the optical flow idealized in 1950 by Gibson. Each arrow represents the direction and speed (indicated by the length of the arrow) of surface elements visible in motion with respect to the observer or vice versa. Neighboring elements move faster than those further away. The 3D motion of the observer with respect to the scene can be estimated through the optical flow. This is the motion field perceived on the retina of an observer, who moves toward the house in the situation of motion of the Fig. 6.1b

the brightness constant in every point of the scene. In these conditions, the brightness (irradiance) in the image plane I(x, y, t) remains constant in time and consequently the total derivative of the time-variant image with respect to time becomes null [7]: I[x(t), y(t), t] = Constant   



dI =0 dt

(6.7)

constant irradiance constraint

The dynamics of the scene is represented as the function I(x, y, t), which depends on the spatial variables (x, y) and on the time t. This implies that the value of the function I [x(t), y(t), t] varies in time in each position (x, y) in the image plane and, consequently, the values of the partial derivatives ∂∂tI are distinguished with respect to the total derivative dI dt . Applying the definition of total derivative for the function I [x(t), y(t), t], the expression (6.7) of the total derivative becomes ∂ I dx ∂ I dy ∂I + + =0 ∂ x dt ∂ y dt ∂t

(6.8)

I x v x + I y v y + It = 0

(6.9)

that is,

where the time derivatives

dx dt

and

dy dt

are the vector components of the motion field: 

dx dy , v(vx , v y ) = v dt dt



6.4 Optical Flow Estimation

501

while the spatial derivatives of the image I x = ∂∂ xI and I y = ∂∂ yI are the components of the spatial gradient of the image ∇ I (I x , I y ). Equation (6.8), written in vector terms, becomes (6.10) ∇I(I x , I y ) · v(vx , v y ) + It = 0 which is the brightness continuity equation of the searched image that links the information of brightness variation, i.e., of the spatial gradient of the image ∇I(I x , I y ), determined in the sequence of multi-temporal images I(x, y, t), and the motion field v(vx , v y ) which must be estimated once the components I x , I y , It are evaluated. In such conditions, the motion field v(vx , v y ) calculated in the direction of the spatial gradient of the image (I x , I y ) adequately approximates the optical flow. Therefore, in these conditions, the motion field coincides with the optical flow. The same Eq. (6.9) is achieved by the following reasoning. Consider the generic point p(x, y) in an image of the sequence which at time t will have a luminous intensity I (x, y, t). The apparent motion of this point is described by the velocity components (vx , v y ) with which it moves, and in the next image, at time t + t, it will be moved to the position (x + vx t, y + v y t) and for the constraint of continuity of luminous intensity the following relation will be valid (irradiance constancy constraint equation): I (x, y, t) = I (x + vx t, y + v y t, t + t)

(6.11)

Taylor’s series expansion of Eq. (6.11) generates the following: I (x, y, t) + x

∂I ∂I ∂I + y + t + · · · = I (x, y, t) ∂x ∂y ∂t

(6.12)

Dividing by t, ignoring the terms above the first and taking the limit for t → 0 the previous equation becomes Eq. 6.8, which is the equation of the total derivative of dI /dt. For the brightness continuity constraint of the scene (Eq. 6.11) in a very small time, and for the constraint of spatial coherence of the scene (the points included in the neighborhood of the point under examination (x, y) move with the same speed during the unit time interval), we can consider valid equality (6.11) that replaced in Eq. (6.12) generates I x vx + I y v y + It = ∇I · v + It ∼ =0

(6.13)

which represents the gradient constraint equation already derived above (see Eqs. (6.9) and (6.10)). Equation (6.13) constitutes a linear relation between spatial and temporal gradient of the image intensity and the apparent motion components in the image plane. We now summarize the conditions to which Eq. (6.13) gradient constraint is subjected for the calculation of the optical flow: 1. Subject to the constraint of preserving the intensity of gray levels during the time t for the acquisition of at least two images of the sequence. In real applications, we

502

6 Motion Analysis

know that this constraint is not always satisfied. For example, in some regions of the image, in areas where edges are present and when lighting conditions change. 2. Also subject to the constraint of spatial coherence, i.e., it is assumed that in areas where the spatial and temporal gradient is evaluated, the visible surface belongs to the same object and all points move at the same speed or vary slightly in the image plane. Also, this constraint is violated in real applications in the image plane regions where there are strong depth discontinuities, due to the discontinuity between pixels belonging to the object and the background, or in the presence of occlusions. 3. Considering, from Eq. (6.13), which −It = I x vx + I y v y = ∇I · v

(6.14)

it is observed that the variation of brightness It , in the same location of the image plane in the time t of acquisition of consecutive images, is given by the scalar product of the spatial gradient vectors of the image ∇I and of the components of the optical flow (vx , v y ) in the direction of the gradient ∇I. It is not possible to determine the orthogonal component in the direction of the gradient, i.e., in the normal direction to the direction of variation of the gray levels (due to the aperture problem). In other words, Eq. (6.13) shows that, estimated from the two consecutive images, the spatial and temporal gradient of the image I x , I y , It it is possible to calculate the motion field only in the direction of the spatial gradient of the image, that is we can determine the component of the optical flow vn only in the normal direction to the edge. From Eq. (6.14) follows: ∇I · v = ∇I  ·vn = −It from which vn =

∇I · v −It =  ∇I   ∇I 

(6.15)

(6.16)

where vn is the measure of the optical flow component that can be calculated in the direction of the normalized spatial gradient with respect to the norm  ∇ I = I x2 I y2 of the gradient vector of the image. If the spatial gradient ∇I is null (that is, there is no change in brightness that occurs along a contour) it follows by (6.15) that It = 0 (that is, we have no temporal variation of irradiance), and therefore there is no movement information available in the examination point. If we verify that the spatial gradient is zero and the time gradient It = 0, at this point the constraint of the optical flow is violated. This impossibility of observing the velocity components in the point in the examination is known as the aperture problem already discussed in Sect. 6.3.6 Ambiguity in motion analysis (see Fig. 6.13). From brightness continuity Eqs. (6.13) and (6.16), also called brightness preservation, we highlight that in every pixel of the image it is not possible to determine

6.4 Optical Flow Estimation

503

the optical flow (vx , v y ) starting from the spatial and temporal image gradient of the image (I x , I y , It ), having two unknowns vx and v y and a single linear equation. It follows that Eq. (6.13) has multiple solutions and the only gradient constraint cannot uniquely estimate the optical flow. Equation (6.16) instead can only calculate the component of optical flow in the direction of the variation of intensity, that is, of maximum variation of the spatial gradient of the image. Equation (6.13), of brightness continuity, can be represented graphically, in the velocity space, as a motion constraint line, as shown in Fig. 6.17a, from which it is observed that all the possible solutions of (6.13) fall on the speed constraint straight line. Calculated the spatial and temporal gradient in a pixel of the image, in the plane of the flow components (vx , v y ), the straight line of speed constraint intersects the axes vx and v y , respectively, in the points (I x /I y , 0) and (0, It /I y ). It is also observed that only the optical flow component vn can be determined. If the real 2D motion is the diagonal one (v x , v y ) indicated by the red dot in Fig. 6.17a and from the dashed vector in Fig. 6.17b, the estimable motion is only the one given by its projection on the gradient vector ∇ I . In geometric terms, the calculation of vn with Eq. (6.16) is equivalent to calculating the distance d of the motion constraint line (I x · vx + I y · v y + It = 0) from the origin of the optical flow (vx , v y ) (see Fig. 6.17a). This constraint means that the optical flow can be calculated only in areas of the image in the presence of edges. In Fig. 6.17c, there are two constraint-speed lines obtained at two points close to each other of the image for each of which the spatial and temporal gradient is calculated by generating the lines (1) and (2), respectively. In this way, we can reasonably hypothesize, that in these two points close together, the local motion is identical (according to the constraint of spatial coherence) and can be determined geometrically as the intersection of the constraint-speed lines producing a good local estimate of the optical flow components (vx , v y ). In general, to calculate both the optical flow components (vx , v y ) at each point of the image, for example, in the presence of edges, the knowledge of the spatial and temporal derivatives I x , I y , It , estimated by the pair of consecutive images), it is not sufficient using Eq. (6.13). The optical flow is equivalent to the motion field in particular conditions as defined above. To solve the problem in a general way we can impose the constraint of spatial coherence on Eq. (6.13), that is, locally, in the vicinity of the point (x, y) in the processing of the motion field, the speed does not change abruptly. The differential approach presents limits in the estimation of the spatial gradient in the image areas, where there are no appreciable variations in gray levels. This suggests calculating the speed of the motion field considering windows of adequate size, centered on the point being processed to satisfy the constraint of the spatial coherence of validity of Eq. (6.15). This is also useful for mitigating errors in the estimated optical flow in the presence of noise in the image sequence.

504

6 Motion Analysis

(a)

(b)

(c) vy

vy 2D real motion Ta no ngen tc alc tial ula com ble po

vn ΔI

It Iy

p

d

(vx,vx) Ix Iy

Ixvx+Iyvy+It=0

ΔI

Con st OF f raint of or p oint the 1

(vx,vx) ne

nt

contour

f the aint o Constr oint 2 p r OF fo

vx

vx

Fig. 6.17 Graphical representation of the constraint of the optical flow based on the gradient. a Equation (6.9) is represented by the straight line locus of the points (vx , v y ) which are possible multiple solutions of the optical flow of a pixel of the image according to the values of the spatial gradient ∇ I and time gradient It . b Of the real 2D motion (v x , v y ) (red point indicated in figure a), it is possible to estimate the speed component of the optical flow in the direction of the gradient (Eq. 6.16) perpendicular to the variation of gray levels (contour) while the component p parallel to the edge cannot be estimated. c In order to solve the aperture problem and more generally to obtain a reliable estimate of the optical flow, it is possible to calculate the velocities even in the pixels close to the one being processed (spatial coherence of the flow) hypothesizing pixels belonging to the same surface that they have the same motion in a small time interval. Each pixel in the speed diagram generates a constraint straight line of the optical flow that tends to intersect in a small area whose center of gravity represents the components of the speed of the estimated optical flow for the set of locally processed pixels

6.4.1 Horn–Schunck Method Horn and Schunck [7] have proposed a method to model the situations in which Eq. (6.14) of continuity is violated (due to little variation in gray levels and noise), imposing the constraint of spatial coherence in the form of a regularization term E 2 in the following function E(vx , v y ) to be minimized: E(vx , v y ) = E 1 (vx , v y ) + λE 2 (vx , v y ) =



(I x vx + I y v y + It )2 + λ

(x,y)∈

  ∂vx 2  ∂v y 2  ∂vx 2  ∂v y 2 + + + ∂x ∂x ∂y ∂y

(6.17)

(x,y)∈

where the process of minimization involves all the pixels p(x, y) of an image I (:, :, t) of the sequence that for simplicity we indicate with . The first term E 1 represents the error of the measures (also known as data energy based on Eq. (6.13), the second term E 2 represents the constraint of the spatial coherence (known also as smoothness energy or smoothness error), and λ is the regularization parameter that controls the relative importance of the constraint of continuity of intensity (Eq. 6.11) and that of the validity of spatial coherence. The introduction of the E 2 constraint of spatial coherence, as an error term expressed by the partial derivatives squared of the velocity components, restricts the class of possible solutions for the flow velocity calculation (vx , v y ) transforming a problem ill conditioned in a problem well posed. It should be noted that in this context the optical flow components vx and v y are functions, respectively, of the spatial coordinates x and y. To avoid confusion on the symbols,

6.4 Optical Flow Estimation

505

let us indicate for now the horizontal and vertical components of optical flow with u = vx and v = v y . Using the variational approach, Horn and Schunck have derived the differential Eq. (6.22) as follows. The objective is the estimation of the optical flow components (u, v) which minimizes the energy function: E(u, v) = (u I x + v I y + It )2 + λ(u 2x + vx2 + u 2y + v2y ) reformulated with the new symbolism where u x =

∂vx ∂x ,

vx =

∂v y ∂x ,

(6.18) uy =

∂vx ∂y

and

∂v y ∂ y , represent the first derivatives of the optical flow components (now denoted

vx = by u and v) with respect to x and y. Differentiating the function (6.18) with respect to the unknown variables u and v, we obtain ⎧ ∂ E(u, v) ⎪ = 2I x (u I x + v I y + It ) + λ2(u x x + u yy ) ⎨ ∂u (6.19) ⎪ ⎩ ∂ E(u, v) = 2I (u I + v I + I ) + λ2(v + v ) y x y t xx yy ∂v

where (u x x + u yy ) and (vx x + v yy ) are, respectively, the Laplacian of u(x, y) and v(x, y) as shown in the notation.3 In essence, the expression corresponding to the Laplacian controls the contribution of the smoothness term of (6.18) of the optical flow, which when rewritten becomes ⎧ ∂ E(u, v) ⎪ = 2I x (u I x + v I y + It ) + 2λ∇ 2 u ⎨ ∂u (6.20) ⎪ ⎩ ∂ E(u, v) = 2I (u I + v I + I ) + 2λ∇ 2 v y x y t ∂v A possible solution to minimization of the function (6.18) is to set to zero the partial derivatives given by (6.20) and approximating the Laplacian with the difference  of the flow components u and v, respectively, with the local averages u and v obtained on a local window W centered on the pixel being processed (x, y), given as follows: u = u − u ≈ ∇ 2 u

3 In

v = v − v ≈ ∇ 2 v

(6.21)

fact, considering the smoothness term of (6.18) and deriving with respect to u we have ∂(u 2x + vx2 + u 2y + v 2y ) ∂u

∂ ∂u(x, y) ∂ ∂u(x, y) +2 ∂u ∂x ∂u ∂y  2 2 ∂ u ∂ u =2 + 2 = 2(u x x + u yy ) = 2∇ 2 u ∂x2 ∂y

=2

which corresponds to the second-order differential operator defined as the divergence of the gradient of the function u(x, y) in a Euclidean space. This operator is known as the Laplace operator or simply Laplacian. Similarly, the Laplacian of the function v(x, y) is derived.

506

6 Motion Analysis

Replacing (6.21) in Eq. (6.20), setting these last equations to zero and reorganizing them, we obtain the following equations as a possible solution to minimize the function E(u, v), given by   λ + I x2 u + v I x I y = λu¯ − I x It (6.22)   λ + I y2 v + u I x I y = λ · v¯ − I y It Thus, we have 2 equations in the 2 unknowns u and v. Expressing v in terms of u and putting it in the other equation, we get the following: I x · u¯ + I y · v¯ + It λ + I x2 + I y2 I x · u¯ + I y · v¯ + It v = v¯ − I y · λ + I x2 + I y2

u = u¯ − I x ·

(6.23)

The calculation of the optical flow is performed by applying the iterative method of Gauss–Seidel, using a pair of consecutive images of the sequence. The goal is to explore the space of possible solutions of (u, v) such that, for a given value found at the kth iteration, the function E(u, v) is minimized within a minimum acceptable

error for the type of dynamic images of the sequence in question. The iterative procedure applied to two images would be the following: 1. From the sequence of images choose, an adjacent pair of images I1 and I2 for each of which a two-dimensional spatial Gaussian filter is applied, with an appropriate standard deviation σ , to attenuate the noise. Apply the Gaussian filter to the time component by considering the other images adjacent to the pair in relation to the size of the standard deviation σt of the time filter. The initial values of the velocity components u and v are assumed zero for each pixel of the image. 2. Kth iterative process. Calculate the velocities u (k) and v(k) for all the pixels (i, j) of the image by applying Eq. (6.23): u (k) (i, j) = u¯ (k−1) (i, j) − I x (i, j) · v

(k)

(i, j) = v¯

(k−1)

I x · u¯ (k−1) + I y · v¯ (k−1) + It λ + I x2 + I y2

I x · u¯ (k−1) + I y · v¯ (k−1) + It (i, j) − I y (i, j) · λ + I x2 + I y2

(6.24)

where I x , I y , It are initially calculated from the pair of consecutive images. 3. Compute the global error e at the kth iteration over the entire image by applying Eq. (6.17):  E 2 (i, j) e= i

j

If the value of the error e is less than a certain threshold es , proceed with the next iteration, that is, return to step 2 of the procedure, otherwise, the iterative

6.4 Optical Flow Estimation

507

process ends and the last values of u and v are assumed as the definitive estimates of the optical flow map of the same dimensions as the images. The regularization parameter λ is experimentally set at the beginning with a value between 0 and 1, choosing by trial and error the optimal value in relation to the type of dynamic images considered. The described algorithm can be modified to use all the images in the sequence. In essence, in the iterative process, instead of always considering the same pair of images, the following image of the sequence is considered at the kth iteration. The algorithm is thus modified: 1. Similar to the previous one, applying the Gaussian spatial and time filter to all the images in the sequence. The initial values of u and v instead of being set to zero are initialized by applying Eq. (6.24) to the first two images of the sequence. The iteration begins with k = 1, which represents the initial estimate of optical flow. 2. Calculation of the (k+1)th estimation of the speed of the optical flow based on the current values of the kth iteration and of the next image of the sequence. Equations are applied to all pixels in the image: u (k+1) (i, j) = u¯ (k) (i, j) − I x (i, j) · v

(k+1)

(i, j) = v¯

(k)

I x · u¯ (k) + I y · v¯ (k) + It λ + I x2 + I y2

I x · u¯ (k) + I y · v¯ (k) + It (i, j) − I y (i, j) · λ + I x2 + I y2

(6.25)

3. Repeat step 2 and finish when the last image in the sequence has been processed. The iterative process requires thousands of iterations, and only experimentally, one can verify which are the optimal values of the regularization parameter λ and of the threshold es that adequately minimizes the error function e. The limits of the Horn and Schunck approach are related to the fact that in real images the constraints of continuity of intensity and spatial coherence are violated. In essence, the calculation of the gradient leads to two contrasting situations: on the one hand, for the calculation of the gradient it is necessary that the intensity varies locally in a linear way and this is generally invalid in the vicinity of the edges; on the other hand, again in the areas of the edges that delimit an object, the smoothness constraint is violated, since normally the surface of the object can have different depths. A similar problem occurs in areas, where different objects move with different motions. In the border areas, the conditions of notable variations in intensity occur, generating very variable values of flow velocity. However, the smoothness component tends to propagate the flow velocity even in areas where the image does not show significant speed changes. For example, this occurs when a single object moves with respect to a uniform background where it becomes difficult to distinguish the velocity vectors associated with the object from the background.

508

6 Motion Analysis

v (Ix,Iy) vn d

(u (u,v) (u,v) Ixu+Iyv+It=0 u

Fig. 6.18 Graphical interpretation of Horn–Schunck’s iterative process for optical flow estimation. During an iteration, the new velocity (u, v) of a generic pixel (x, y) is updated by subtracting the value of the local average velocity (u, ¯ v¯ ) the update value according to Eq. (6.24) moving on the line perpendicular to the line of the motion constraint and in the direction of the spatial gradient

Fig. 6.19 Results of the optical flow calculated with the Horn–Schunck method. The first line shows the results obtained on synthetic images [10], while in the second line the flow is calculated on real images

Figure 6.18 shows a graphical interpretation of Horn–Schunck’s iterative process for optical flow estimation. The figure shows in the diagram of the optical flow components how, for a pixel in process, during any iteration, a speed will be assigned for a point lying on a line perpendicular to the constraint-speed line and passing through the average speed (u, ¯ v¯ ) value of the pixels in its vicinity. In essence, the current pixel speed is updated by subtracting an update value from the current average value according to (6.25). This normally happens in the homogeneous areas of the image where the smoothness constraint is maintained. In the presence of edges, the situation changes, forming at least two homogeneous areas of speed that are distinct from one another. Figure 6.19 shows the results of the Horn–Schunck method applied on synthetic and real images with various types of simple and complex motion. The spatial and temporal gradient was calculated with windows of size from 5 × 5 up to 11 × 11.

6.4 Optical Flow Estimation

509

To mitigate the limitations highlighted above, Nagel [11] proposed an approach based always on conservation of intensity but reformulated the function of Horn and Schunck considering the terms of the second order to model the gray level variations caused by motion. For the speed calculation, at each point of the image, the following function is minimized, defined in a window W of the image centered on the pixel being processed: e=

  

I (x, y, t) − I (x0 , y0 , t0 ) − I x · [x − u] − I y · [y − v]

W



2 1 1 I x x · [x − u]2 − I x y · [x − u] · [y − v] − I yy · [y − v]2 dxdy 2 2

(6.26)

With the introduction of the second-order derivative in areas where there are small variations in intensity and where, normally, the gradient is not very significant, there are appreciable values of the second derivatives, thus eliminating the problem of attenuating the flow velocities at the edges of the object with respect to the background. Where contours with high gradient values are present, the smoothness constraint is forced only in the direction of the contours, while it is attenuated in the orthogonal direction to the contours, since the values of the second derivatives assume contained values. The constraints introduced by Nagel are called oriented smoothness constraints. Given the complexity of the function (6.26), Nagel applied it only to image areas with significant regions (contours, edges, etc.) and only in these points is the functional minimized. In the points (x0 , y0 ) where the contours are present only the main curvature is zero, i.e., I x y = 0, while the gradient is different from zero, as well as the second derivatives I x x and I yy . The maximum gradient value occurs at the contour at point (x0 , y0 ), where at least one of the second derivatives passes through zero. If, for example, I x x = 0, this implies that I x is maximum and the component I y = 0 (inflection points). With these assumptions the previous equation is simplified and we have    2 1 I (x, y, t) − I (x0 , y0 , t0 ) − I x · [x − u] − I yy · [y − v]2 dxdy e= 2 (y,y)∈W (6.27) from which we can derive the system of two differential equations in two unknowns, namely the velocity components u and v. Even the model set by Nagel, in practice, has limited applications due to the difficulty in correctly calculating the second-order derivatives for images that normally present a nonnegligible noise.

6.4.2 Discrete Least Squares Horn–Schunck Method A simpler approach to estimate the optical flow is based on the minimization of the function (6.17) with the least squares regression method (Least Square Error-LSE) and approximating the derivatives of the smoothness constraint E 2 with simple symmetrical or asymmetrical differences. With these assumptions, if I (i, j) is the

510

6 Motion Analysis

pixel of the image being processed, the smoothness error constraint E 2 (i, j) is defined as follows: E 2 (i, j) =

 1 (u i+1, j − u i, j )2 + (u i, j+1 − u i, j )2 + (vi+1, j − vi, j )2 + (vi, j+1 − vi, j )2 (6.28) 4

A better approximation would be obtained by calculating the symmetric differences (of the type u i+1, j − u i−1, j ). The term E 1 based on Eq. (6.13), constraint of the optical flow error, results in the following: E 1 (i, j) = [I x (i, j)u i, j + I y (i, j)vi, j + It (i, j)]2

(6.29)

The regression process involves finding the set of unknowns of the flow components {u i, j , vi, j }, which minimizes the following function: e(i, j) =



E 1 (i, j) + λE 2 (i, j)

(6.30)

(i, j)∈

According to the LSE method, differentiating the function e(i, j) with respect to the unknowns u i, j and vi, j for E 1 (i, j) we have ⎧ ∂ E (i, j) 1 ⎪ = 2[I x (i, j)u i, j + I y (i, j)vi, j + It (i, j)]I x (i, j) ⎪ ⎨ ∂u i, j ∂ E 1 (i, j) ⎪ ⎪ ⎩ = 2[I x (i, j)u i, j + I y (i, j)vi, j + It (i, j)]I y (i, j) ∂vi, j

(6.31)

and for E 2 (i, j) we have ∂ E 2 (i, j) = −2[(u i+1, j − u i, j ) + (u i, j+1 − u i, j )] + 2[(u i, j − u i−1, j ) + (u i, j − u i, j−1 )] ∂u i, j   = 2 (u i, j − u i+1, j ) + (u i, j − u i, j+1 ) + (u i, j − u i−1, j ) + (u i, j − u i, j−1 )

(6.32)

In (6.32), we have the only unknown term u i, j and we can simplify it by putting it in the following form:   1 1 ∂ E 2 (i, j) = 2 u i, j − (u i+1, j + u i, j+1 + u i−1, j + u i, j−1 ) = 2[u i, j − u i, j ] 4 ∂u i, j 4  

(6.33)

local average u i, j

Differentiating E 2 (i, j) (Eq. 6.28) with respect to vi, j we obtain the analogous expression: 1 ∂ E 2 (i, j) = 2[vi, j − vi, j ] (6.34) 4 ∂vi, j

6.4 Optical Flow Estimation

511

Combining together the results of the partial derivatives of E 1 (i, j) and E 2 (i, j) we have the partial derivatives of the function e(i, j) to be minimized: ⎧ ∂e(i, j) ⎪ = 2[u i, j − u i, j ] + 2λ[I x (i, j)u i, j + I y (i, j)vi, j + It (i, j)]I x (i, j) ⎪ ⎨ ∂u i, j ⎪ ∂e(i, j) ⎪ ⎩ = 2[vi, j − vi, j ] + 2λ[I x (i, j)u i, j + I y (i, j)vi, j + It (i, j)]I y (i, j) ∂vi, j

(6.35)

Setting to zero the partial derivatives of the error function (6.35) and solving with respect to the unknowns u i, j and vi, j , the following iterative equations are obtained: (k+1) u i, j (k+1) vi, j

=

(k) u¯ i, j

=

(k) v¯ i, j

(k)

− I x (i, j) ·

1 + λ[(I x2 (i, j) + I y2 (i, j)] (k)

− I y (i, j) ·

(k)

I x (i, j) · u¯ i, j + I y (i, j) · v¯ i, j + It (i, j) (k)

I x (i, j) · u¯ i, j + I y (i, j) · v¯ i, j + It (i, j)

(6.36)

1 + λ[(I x2 (i, j) + I y2 (i, j)]

6.4.3 Horn–Schunck Algorithm For the calculation of the optical flow at least two adjacent images of the temporal sequence are used at time t and t +1. In the discrete case, iterative Eq. (6.36) are used to calculate the value of the velocities u i,k j and vi,k j to the kth iteration for each pixel (i, j) of the image of size M × N . The spatial and temporal gradient in each pixel is calculated using one of the convolution masks (Sobel, Roberts, . . .) described in Chap. 1 Vol. II. The original implementation of Horn estimated the spatial and temporal derivatives (the data of the problem) I x , I y , It considering the mean of the differences (horizontal, vertical, and temporal) between the pixel being processed (i, j) and the 3 spatially and temporally adjacent, given by 1 I (i, j + 1, t) − I (i, j, t) + I (i + 1, j + 1, t) − I (i + 1, j, t) 4  + I (i, j + 1, t + 1) − I (i, j, t + 1) + I (i + 1, j + 1, t + 1) − I (i + 1, j, t + 1)

I x (i, j, t) ≈

1 I (i + 1, j, t) − I (i, j, t) + I (i + 1, j + 1, t) − I (i, j + 1, t) 4  + I (i + 1, j, t + 1) − I (i, j, t + 1) + I (i + 1, j + 1, t + 1) − I (i, j + 1, k + 1) 1 It (i, j, t) ≈ I (i, j, t + 1) − I (i, j, t) + I (i + 1, j, t + 1) − I (i + 1, j, t) 4  + I (i, j + 1, t + 1) − I (i, j + 1, t) + I (i + 1, j + 1, t + 1) − I (i + 1, j + 1, k)

I y (i, j, t) ≈

(6.37)

To make the calculation of the optical flow more efficient in each iteration, it is useful to formulate Eq. (6.36) as follows: u i,(k+1) = u¯ i,(k)j − α I x (i, j) j (k+1)

vi, j

(k)

= v¯ i, j − α I y (i, j)

(6.38)

512

6 Motion Analysis

where

(k)

α(i, j, k) =

(k)

I x (i, j) · u¯ i, j + I y (i, j) · v¯ i, j + It (i, j) 1 + λ[(I x2 (i, j) + I y2 (i, j)]

(6.39)

Recall that the averages u¯ i, j and v¯ i, j are calculated on the 4 adjacent pixels as indicated in (6.33). The pseudo code of the Horn–Schunck algorithm is reported in Algorithm 26. Algorithm 26 Pseudo code for the calculation of the optical flow based on the discrete Horn–Schunck method. 1: Input: Maximum number of iterations N iter = 10; λ = 0.1 (Adapt experimentally) 2: Output: The dense optical flow u(i, j), v(i, j), i = 1, M and j = 1, N 3: for i ← 1 to M do 4: 5: 6: 7: 8: 9:

for j ← 1 to N do Calculates I x (i, j, t), I y (i, j, t), and It (i, j, t) with Eq. (6.37) Handle the image areas with edges u(i, j) ← 0 v(i, j) ← 0 end for

10: end for 11: k ← 1 12: while k ≤ N iter do 13: 14: 15: 16: 17:

for i ← 1 to M do for j ← 1 to N do Calculate with edge handling u(i, ¯ j) and v¯ (i, j) Calculate α(i, j, k) with Eq. (6.39) Calculate u(i, j) and v(i, j) with Eq. (6.38)

18:

end for

19: 20:

end for k ←k+1

21: end while 22: return u and v

6.4 Optical Flow Estimation

513

6.4.4 Lucas–Kanade Method The preceding methods, used for the calculation of the optical flow, have the drawback, being iterative, of the noncertainty of convergence to minimize the error function. Furthermore, they require the calculation of derivatives of a higher order than the first and, in areas with gray level discontinuities, the optical flow rates are estimated with a considerable error. An alternative approach is based on the assumption that, if locally for windows of limited dimensions the optical flow velocity components (vx , v y ) remain constant, from the equations of conservation of brightness we obtain a system of linear equations solvable with least squares approaches [12]. Figure 6.17c shows how the lines defined by each pixel of the window represent geometrically in the domain (vx , v y ) the optical flow Eq. 6.13. Assuming that the window pixels have the same speed, the lines intersect in a limited area whose center of gravity represents the real 2D motion. The size of the intersection area of the lines also depends on the error with which the spatial derivatives (I x , I y ) and the temporal derivative It are estimated, caused by the noise of the sequence of images. Therefore, the velocity v of the pixel being processed is estimated by the linear regression method (line fitting) by setting a system of linear overdetermined equations defining an energy function. Applying the brightness continuity Eq. (6.14) for the N pixels pi of a window W (centered in the pixel in elaboration) of the image, we have the following system of linear equations: ⎧ ⎪ ∇ I ( p1 ) · v( p1 ) = −It ( p1 ) ⎪ ⎪ ⎪ ⎪ ⎪ ∇ ⎨ I ( p2 ) · v( p2 ) = −It ( p2 ) (6.40) ... ... ⎪ ⎪ ⎪. . . . . . ⎪ ⎪ ⎪ ⎩∇ I ( p ) · v( p ) = −I ( p ) N N t N that in extended matrix form becomes ⎡ ⎤ ⎤ ⎡ I t ( p1 ) I x ( p1 ) I y ( p1 )   ⎢ I t ( p2 ) ⎥ ⎢ I x ( p2 ) I y ( p2 ) ⎥ vx ⎢ ⎥ ⎥ ⎢ = −⎢ . ⎥ ⎥· ⎢ .. .. . ⎣ ⎦ ⎣ . v y . . ⎦ I x ( p N ) I y ( p N )) It ( p N )

(6.41)

and indicating with A the matrix of the components I x ( pi ) and I y ( pi ) of the spatial gradient of the image, with b the matrix of the components It ( pi ) of the temporal gradient and with v the speed of the optical flow, we can express the previous relation in the compact matrix form: A ·  v = −  b  N ×2

2×1

N ×1

(6.42)

514

6 Motion Analysis

With N > 2, the linear equation system (6.42) is overdetermined and this means that it is not possible to find an exact solution, but only an approximate estimate v˜ , which minimizes the norm of the vector e derived with the least squares approach:  e = Av + b 

(6.43)

solvable as an overdetermined inverse problem, in the sense of finding the minimum norm of the error vector which satisfies the following linear system: (AT · A) · v˜ = AT b      2×2

2×1

(6.44)

2×1

from which the following estimate of the velocity vector is obtained: v˜ = (AT · A)−1 AT b

(6.45)

for the image pixel being processed, centered on the W window. The solution (6.45) exists if the matrix (AT A)4 is calculated as follows: N N I x ( pi )I x ( pi ) i=1 I x ( pi )I y ( pi ) Iαα Iαβ T i=1 (6.46) = (A A) = N N Iαβ Iββ i=1 I x ( pi )I y ( pi ) i=1 I y ( pi )I y ( pi ) with size 2 × 2 and is invertible (not singular), i.e., if 2 det(AT A) = Iαα Iββ − Iαβ = 0

(6.47)

The solvability of (6.44) is verifiable also by analyzing the eigenvalues λ1 and λ2 of the matrix (AT A) that must be greater than zero and not very small to avoid artifacts in the presence of noise. The eigenvector analysis of the (AT A) has already been considered in Sect. 6.3 Vol. II in determining points of interest. In fact, it has been verified that if λ1 /λ2 is too large we have seen that the pixel p being processed belongs to the contour, and therefore the method is conditioned by the aperture problem. Therefore, this method operates adequately if the two eigenvalues are sufficiently large and of the same order of magnitude (i.e., the flow is estimated only in areas of nonconstant intensity, see Fig. 6.20) which is the same constraint required for the automatic detection of points of interest (corner detector) in the image. In particular, when all the spatial gradients are parallel or both null, the AT A matrix becomes singular (rank 1). In the homogeneous zones, the gradients are very small and there are small eigenvalues (numerically unstable solution).

4 The matrix (A T

A) is known in the literature as the tensor structure of the image relative to a pixel p. A term that derives from the concept of tensor which generically indicates a linear algebraic structure able to mathematically describe an invariable physical phenomenon with respect to the adopted reference system. In this case, it concerns the analysis of the local motion associated with the pixels of the W window centered in p.

6.4 Optical Flow Estimation

515

Eigenvalues λ1 and λ2 of the tensor matrix ATA are small

/ in the contour points (aperture problem)

Homogeneous zone Large eigenvalues in the presence of texture

Fig. 6.20 Operating conditions of the Lucas–Kanade method. In the homogeneous zones, there are small values of the eigenvectors of the tensor matrix AT A while in the zones with texture there are high values. The eigenvalues indicate the robustness of the contours along the two main directions. On the contour area the matrix becomes singular (not invertible) if all the gradient vectors are oriented in the same direction along the contour (aperture problem, only the normal flow is calculable)

Returning to Eq. (6.46), it is also observed that in the matrix (AT A) does not appear the component of temporal gradient It and, consequently, the accuracy of the optical flow estimation is closely linked to the correct calculation of the spatial gradient components I x and I y . Now, appropriately replacing in Eq. (6.45) the inverse matrix (AT A)−1 given by (6.46), and the term AT b is obtained from AT b =

I x ( p1 ) I y ( p1 )

⎡ ⎤ It ( p1 ) ⎢ It ( p2 ) ⎥  ⎥ I x ( p2 ) . . . I x ( p N ) ⎢ ⎢ . ⎥= − ⎥ I y ( p2 ) . . . I y ( p N ) ⎢ − ⎣ .. ⎦ It ( p N )

N i=1 I x ( pi )It ( pi ) N i=1 I y ( pi )It ( pi )



=

−Iαγ −Iβγ



(6.48)

it follows that the velocity components of the optical flow are calculated from the following relation: ⎡ Iαγ Iββ −Iβγ Iαβ ⎤ Iαα Iββ −I 2 v˜ x = − ⎣ Iβγ Iαα −IαγαβIαβ ⎦ (6.49) v˜ y 2 Iαα Iββ −Iαβ

The process is repeated for all the points of the image, thus obtaining a dense map of optical flow. Window sizes are normally chosen at 3×3 and 5×5. Before calculating the optical flow, it is necessary to filter the noise of the images with a Gaussian filter with standard deviation σ according to σt , for filtering in the time direction. Several other methods described in [13] have been developed in the literature. Figure 6.21 shows the results of the Lucas–Kanade method applied on synthetic and real images with various types of simple and complex motion. The spatial and temporal gradient was calculated with windows of sizes from 3 × 3 up to 13 × 13. In the case of RGB color image, Eq. (6.42) (which takes constant motion locally) can still be used considering the gradient matrix of size 3 · N × 2 and the vector of

516

6 Motion Analysis

the time gradient of size 3 · N × 1 where in essence each pixel is represented by the triad of color components thus extending the dimensions of the spatial and temporal gradient matrices.

6.4.4.1 Variant of the Lucas–Kanade Method with the Introduction of Weights The original Lucas–Kanade method assumes that each pixel of the window has constant optical flow. In reality, this assumption is violated and it is reasonable to weigh with less importance the influence of the pixels further away from the pixel being processed. In essence, the speed calculation v is made by giving a greater weight to the pixel being processed and less weight to all the other pixels of the window in relation to the distance from the pixel being processed. If W = diag{w1 , w2 , . . . , wk } is the diagonal matrix of the weights of size k ×k, the equalities W T W = W W = W 2 are valid. Multiplying both the members of (6.42) with the matrix weight W we get WAv = Wb

(6.50)

where both members are weighed in the same way. To find the solution, some matrix manipulations are needed. We multiply both members of the previous equation with (WA)T and we get (WA)T WAv = (WA)T Wb AT WT WAv = AT WT Wb AT WWAv = AT WWb

Fig. 6.21 Results of the optical flow calculated with the Lucas–Kanade method. The first line shows the results obtained on synthetic images [10], while in the second line the flow is calculated on real images

6.4 Optical Flow Estimation

517 T W2 A v = AT W2 b A 

(6.51)

2×2

If the determinant of (AT W2 A) is different from zero exists (AT W2 A)−1 , that is, its inverse, and so we can solve the last equation of (6.51) with respect to v obtaining: v = (AT W2 A)−1 AT W2 b

(6.52)

The original algorithm is modified using for each pixel p(x, y) of the image I (:, :, t) the system of Eq. (6.50), which includes the matrix weight and the solution at least squares is given by (6.52).

6.4.5 BBPW Method This BBPW (Brox–Bruhn–Papenberg–Weickert) method, described in [14], is an evolution of the Horn–Schunck method in the sense that it tries to overcome the constraint of the constancy of gray levels (imposed by Eq. (6.11): I (x, y, t) = I (x + u, y +v, t +1)) and the limitation of small displacements of objects between adjacent images of a sequence. These assumptions led to the linear equation I x u + I y v+ It = 0 (linearization with Taylor’s expansion) which in reality are often violated in particular in outdoor scenes (noticeable variation in gray levels) and by displacements that affect several pixels (presence of very fast objects). The BBPW algorithm uses a more robust constraint based on the constancy of the spatial gradient given by: ∇x,y I (x, y, t) = ∇x,y I (x + u, y + v, t + 1)

(6.53)

where ∇x,y denoting the gradient symbol, is limited to derivatives with respect to the axes of x and y, it does not include the time derivative. Compared to the quadratic functions of the Horn–Schunck method which do not allow flow discontinuities (in the edges and due to the presence of gray levels noise), Eq. (6.53) based on the gradient is a more robust constraint. With this new constraint it is possible to derive an energy function E 1 (u, v) (data term) that penalizes the drifts from these assumptions of constancy of levels and spatial gradient that results in the following:  ! 2 I (x + u, y + v, t + 1) − I (x, y, t) E 1 (u, v) = (x,y)∈

 2 " + λ1 ∇x,y I (x + u, y + v, t + 1) − ∇x,y I (x, y, t))

(6.54)

where λ1 > 0 weighs one assumption relative to the other. Furthermore, the smoothness term E 2 (u, v) must be considered as done with the function (6.17). In this case the third component of the gradient is considered, which relates two temporally adjacent images from t to t + 1. Therefore, by indicating with ∇3 = (∂x , ∂ y , ∂t ) the associated space–time gradients of smoothness, the smoothness term E 2 is formu-

518

6 Motion Analysis

lated as follows: E 2 (u, v) =

$  # |∇3 u|2 + |∇3 v|2

(6.55)

(x,y)∈

The total energy function E(u, v) to be minimized, which estimates the optical flow (u, v) for each pixel of the image sequence I (:, :, t), is given by the weighted sum of the terms data and smoothness: E(u, v) = E 1 (u, v) + λ2 E 2 (u, v)

(6.56)

where λ2 > 0 weighs appropriately the smoothness term with respect to that of the data (total variation with respect to the assumption of the constancy of the gray levels and gradient). Since the data term E 1 is set with quadratic expression (Eq. 6.54), the outlier s (due to the variation of pixel intensities and the presence of noise) heavily influence the flow estimation. Therefore, instead of the least squares approach, we define the optimization problem set with a more robust energy function, based on the increasing concave function Ψ (s), given by: %

= 0.001 (6.57) Ψ (s 2 ) = s 2 + 2 ≈ |s| This function is applied to the error functions E 1 and E 2 , expressed, respectively, by Eqs. (6.54) and (6.55), thus setting the problem of minimization of the energy function in terms of regularized energy function. The regularizing functions result in the following: #!  2 Ψ I (x + u, y + v, t + 1) − I (x, y, t) E 1 (u, v) = (x,y)∈

 2 "$ + λ ∇x,y I (x + u, y + v, t + 1) − ∇x,y I (x, y, t))

and E 2 (u, v) =



Ψ

$ # |∇3 u|2 + |∇3 v|2

(6.58)

(6.59)

(x,y)∈

It is understood that the total energy, expressed by (6.56), is the weighted sum of the terms E 1 and E 2 , respectively, data and smoothness, controlled with the regularization parameter λ2 . It should be noted that with this approach the influence of the outlier s is attenuated since the optimization problem based on the l1 nor m of the gradient (known as total variation–TV l1 ) is used instead of the l2 nor m. The goal now is to find the functions u(x, y) and v(x, y) that minimize these energy functions by trying to find a global minimum. From the theory of calculus of variations, it is stated that a general way to minimize an energy function is that it must satisfy the Euler–Lagrange differential equations.

6.4 Optical Flow Estimation

519

Without giving details on the derivation of differential equations based on the calculus of variations [15], the resulting equations are reported:  #  # $ $  2 2  ·  I I + λ (I I + I I ) − λ · div Ψ ||∇ u||2 + ||∇ v||2 ∇ u = 0 Ψ It2 + λ1 I xt + I yt x t x y yt 1 x x xt 2 3 3 3

(6.60)

 #  # $ $  2 2  ·  I I + λ (I I + I I ) − λ · div Ψ ||∇ u||2 + ||∇ v||2 ∇ v = 0 Ψ It2 + λ1 I xt + I yt y t x y xt 1 yy yt 2 3 3 3

(6.61)

where Ψ indicates the derivative of Ψ with respect to u (in the 6.60) and with respect to v (in the 6.61). The divergence div indicates the sum of the space–time gradients of smoothness ∇3 = (∂x , ∂ y , ∂t ) related to u and v. Recall that I x , I y , and It are the derivatives of I (:, :, t) with respect to the spatial coordinates x and y of the pixels and with respect to the time coordinate t. Also, I x x , I yy , I x y , I xt , and I yt are the their second derivatives. The problem data is all the derivatives calculated from two consecutive images of the sequence. The solution w = (u, v, 1), in each point p(x, y) ∈ , for nonlinear Eqs. (6.60) and (6.61) can be found with an iterative method of numerical approximation. The authors of BBPW used the one based on fixed point iterations5 on w. The iterative formulation with index k, of the previous nonlinear equations, starting from the initial value w(0) = (0, 0, 1)T , results in the following: # $$  # $ ⎧ # (k+1) 2 (k+1) 2  (k+1) 2 (k) (k+1) (k) (k+1) (k) (k+1) Ψ It + λ1 I xt + I yt + λ1 I x x I xt + I x y I yt · I x It ⎪ ⎪ ⎪ ⎪ ⎪  #&  ⎪ &2 & &2 $ ⎪ ⎪ − λ2 · div Ψ &∇3 u (k+1) & + &∇3 v (k+1) & ∇3 u (k+1) = 0 ⎨ # # $$  # $ ⎪ (k) (k+1) (k) (k+1) (k) (k+1)  (k+1) 2 + λ  I (k+1) 2 +  I (k+1) 2 ⎪ + λ1 I yy I yt + I x y I xt · I y It ⎪ 1 xt yt ⎪ Ψ It ⎪ ⎪ ⎪  #&  &2 & &2 $ ⎪ ⎩ − λ2 · div Ψ &∇3 u (k+1) & + &∇3 v (k+1) & ∇3 v (k+1) = 0

(6.62)

This new system is still nonlinear due to the non-linearity of the function Ψ and (k+1) (k+1) the derivatives I∗ . The removal of the non-linearity of the derivatives I∗ is obtained with their expansion to the Taylor series up to the first order: It(k+1) ≈ It(k) + I x(k) du (k) + I y(k) dv(k) (k+1)

(k) (k) ≈ I xt + I x(k) + I x(k) x du y dv

(k+1)

(k) (k) (k) ≈ I yt + I x(k) + I yy dv y du

I xt I yt

(k)

(6.63)

(k)

5 Represents the generalization of iterative methods. In general, we want to solve a nonlinear equation

f (x) = 0 by leading back to the problem of finding a fixed point of a function y = g(x), that is, we want to find a solution α such that f (α) = 0 ⇐⇒ α = g(α). The iteration function is in the form x (k+1) = g(x (k) ), which iteratively produces a sequence of x for each k ≥ 0 and for a x (0) initial assigned. Not all the iteration functions g(x) guarantee convergence at the fixed point. It is shown that, if g(x) is continuous and the sequence x (k) converges, then this converges to a fixed point α, that is, α = g(α) which is also a solution of f (x) = 0.

520

6 Motion Analysis

Therefore, we can consider the unknowns u (k+1) and v(k+1) separate, in the solutions of the previous iterative process u (k) and v(k) , and the unknown increments du (k) and dv(k) , having u (k+1) = u (k) + du (k)

v(k+1) = v(k) + dv(k)

(6.64)

Substituting the expanded derivatives (6.63) in the first equation of the system (6.62) and separating for simplicity the terms data and smoothness, we have by definition the following expressions: # 2 (k) (k) (Ψ ) E 1 := Ψ It + I x(k) du (k) + I y(k) dv(k)   (k)   $ (k) (k) (k) 2 (k) (k) (k) 2 + λ1 I xt + I x(k) + I x(k) + I yt + I x(k) + I yy dv x du y dv y du #&  &2 &  &2 $ (k) (Ψ ) E 2 := Ψ &∇3 u (k) + du (k) & + &∇3 v (k) + v (k) &

(6.65)

(6.66)

(k)

The term (Ψ ) E1 defined with (6.65) is interpreted as a factor of robustness in the (k) data term, while the term (Ψ ) E2 defined by (6.66), is considered to be the diffusivity in the smoothness term. With these definitions, the first equation of the system (6.62) is rewritten as follows:   (k) (k)  (k) (k) (k) (Ψ ) E · I x It + I x du (k) + I y dv (k) 1    (k) (k)  (k) (k) (k) (k)  (k) (k) (k) + λ1 (Ψ ) E · I x x I xt + I x x du (k) + I x y dv (k) + I x y I yt + I x y du (k) + I yy dv (k) 1 #  $ (k) − λ2 div (Ψ ) E ∇3 u (k) + du (k) = 0

(6.67)

2

Similarly, the second equation of the system is redefined (6.62). For a fixed value of k (6.67) is still non-linear, but now, having already estimated u (k) and v(k) with the approximation at the first fixed point, the unknowns in Eq. (6.67) are the increments du (k) and dv(k) . Thus, there remains only the non-linearity due to the derivative Ψ , but was chosen as a convex function,6 and the remaining optimization problem is a convex problem, that is, it can exist a minimal and unique solution. The non-linearity in Ψ can be removed by applying a second iterative process based on the search for the fixed point of Eq. (6.67). Now consider the unknown variables to iterate du (k,l) and dv(k,l) , where l indicates the lth iteration. We assume (k,l) as initial values du (k,l) = 0 and dv(k,l) = 0, and we indicate with (Ψ ) E 1 and

(Ψ )(k,l) E 2 , respectively, the factor robustness and diffusivity expressed by the respective Eqs. (6.65) and (6.66) at the iteration (k,l)th. Finally, we can formulate the first

f (x) with real values and defined in an interval is called convex if the segment joining any two points of its graph is above the graph itself. Convex optimization problems simplify the analysis and solution of a convex problem. It is shown that a convex function, defined in a convex set, has no solution, or has only global solutions, and cannot have exclusively local solutions.

6 A function

6.4 Optical Flow Estimation

521

linear system equation in an iterative form with the unknowns du (k,l+1) and dv(k,l+1) given by # $  (k) (k) It + I x(k) du (k,l+1) + I y(k) dv(k,l+1) (Ψ )(k,l) E1 · Ix # $  (k) (k) (k) (k,l+1) (k,l+1) + λ1 (Ψ ) E 1 · I x(k) + I x(k) x I xt + I x x du y dv # $ (k) (k) (k,l+1) (k) (k,l+1) + I yy dv + I x(k) y I yt + I x y du # $$ # (k) − λ2 div (Ψ ) E 2 ∇3 u (k) + du (k,l+1) = 0

(6.68)

This system can be solved using the normal iterative numerical methods (Gauss– Seidel, Jacobi, Successive Over-Relaxation—SOR also called the method of overrelaxation) for a linear system also of large size and with sparse matrices (presence of null elements).

6.4.6 Optical Flow Estimation for Affine Motion The model of motion considered until now is of pure translation. If we consider that a small region R at the time t is subject to an affine motion model, at the time t + t the speed (or displacement) of the relative pixels is given by ⎤ ⎡ ⎤ ⎡ u(x; a) p1 + p2 x + p3 y ⎦ ⎦=⎣ (6.69) u(x; p) = ⎣ v(x; a) p4 + p5 x + p6 remembering the affine transformation equations (described in Sect. 3.3 Vol. II) and assuming that the speed is constant [12] for the pixels of the region R. By replacing (6.69) in the equation of the optical flow (6.13), it is possible to set the following error function: 2  ∇ I T u(x; p) + It e(x; p) = x∈R

=



I x p1 + I x p2 x + I x p3 y + I y p4 + I y p5 x + I y p6 y + I t

2

(6.70)

x∈R

to be minimized with the least squares method. Given that there are 6 unknown motion parameters and each pixel provides only a linear equation, at least 6 pixels of the region are required to set up a system of linear equations. In fact, the minimization of the error function requires the differentiation of (6.70) with respect to the unknown vector p, to set the result of the differentiation equal to zero and solve with respect to the motion parameters pi , thus obtaining the following system of linear equations: ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

I x2 x I x2 y I x2 Ix I y x Ix I y y Ix I y

x I x2 x 2 I x2 x y I x2 x Ix I y x 2 Ix I y x y Ix I y

y I x2 x y I x2 y 2 I x2 y Ix I y x y Ix I y y 2 Ix I y

Ix I y x Ix I y y Ix I y I y2 x I y2 y I y2

x Ix I y x 2 Ix I y x y Ix I y x I y2 x 2 I y2 x y I y2

⎤ ⎡ y Ix I y ⎡ p ⎤ 1 x y Ix I y ⎥ ⎥ ⎢ p2 ⎥ ⎢ ⎢ ⎢ ⎥ y 2 Ix I y ⎥ ⎥ ⎢ p3 ⎥ ⎢ ⎥⎢ ⎥ = −⎢ 2 y I y ⎥ ⎢ p4 ⎥ ⎢ ⎥ ⎦ ⎣ ⎣ x y I y2 ⎦ p5 p 2 2 6 y I y

⎤ I x It x I x It ⎥ ⎥ y I x It ⎥ ⎥ I y It ⎥ x I y It ⎦ y I y It

(6.71)

522

6 Motion Analysis

As for the pure translation also for the affine motion the approximation to the Taylor series constitutes a very approximate estimate of the real motion. This imprecision of motion can be mitigated with the iterative alignment approach proposed by Lucas– Kanade [12]. In the case of affine motion with large displacements, we can use the multi-resolution approach described in the previous paragraphs. With the same method, it is possible to manage other motion models (homography, quadratic, . . .). In the case of a planar quadratic motion model, it is approximated by  u = p1 + p2 x + p3 y + p7 x 2 + p8 x y (6.72) v = p4 + p5 x + p6 y + p7 x y + p8 y 2 while the homography one results: 

u = x − x v = y − y

⎧ p1 + x p2 + yp3 ⎪ x = ⎪ ⎪ p7 + x p8 + yp9 ⎪ ⎨ wher e

⎪ ⎪ ⎪ p + x p5 + yp6 ⎪ ⎩ y = 4 p7 + x p8 + yp9

(6.73)

6.4.6.1 Segmentation for Homogeneous Motion Regions Once the optical flow is detected, assuming a model of affine motion, it may be useful to partition the image into homogeneous regions (in the literature also known as layer ) despite the difficulty that one has at the edges of objects due to the problem of local aperture. In literature [15,16] various methods are used based on the gradient of the image and on statistical approaches in particular to manage the discontinuity of the flow field. A simple method of segmentation is to estimate a affine motion hypothesis by partitioning the image into elementary blocks. The affine motion parameters are estimated for each block minimizing the error function (6.70) and then only the blocks with a low residual error are considered. To classify the various types of motion detected, it is possible to analyze in the parametric space p how the flow associated with each pixel is distributed. For example, with the K-means algorithm it applies to significant parameters of affine motion joining clusters very close together and with low population, thus obtaining a quantitative description of the various types of motion present in the scene. The classification process can be iterated by assigning each pixel to the appropriate motion hypothesis. The essential steps of the segmentation procedure of the homogeneous areas of motion based on K-means are 1. Partitioning the image in small blocks of n × n pixels (nonoverlapping blocks); 2. Link the affine flow model of the pair of blocks B(t) and B(t + 1) relative to the consecutive images; 3. Consider the 6 parameters of the model of affine motion to characterize the motion of the block in this parametric space; 4. Classi f y with the K-means method in this parametric space R6 ;

6.4 Optical Flow Estimation

5. 6. 7. 8.

523

U pdate the k cluster of affine motion models found; Relabel the classes of affine motion (segmentation); Re f ine the model of affine motion for each segmented area; Repeat steps 6 and 7 until it converges to significant motion classes.

6.4.7 Estimation of the Optical Flow for Large Displacements The algorithms described in the previous paragraphs (Horn–Schunck, Lucas–Kanade, BBPW) are essentially based on the assumption of the constancy of gray levels (or of color) and the constraint of spatial coherence (see Sect. 6.4), i.e., it is assumed that in areas where the spatial and temporal gradient is evaluated, the visible surface belongs to the same object and all points move at the same speed or vary slightly in the image plane. We know that these constraints in real applications are violated, and therefore the equation of the optical flow I x u + I y v + It = 0 would not be valid. Essentially, the algorithms mentioned above, based on local optimization methods, would not work correctly for the detection of the optical flow (u, v) in the hypothesis of objects with great movement. In fact, if the objects move very quickly the pixel intensities change rapidly and the estimate of the space–time derivatives fails because they are locally calculated on windows of inadequate dimensions with respect to the motion of the objects. This last aspect could be solved by using a larger window to estimate large movements but this would strongly violate the assumption of spatial coherence of motion. To make these algorithms feasible in these situations it is first of all useful to improve the accuracy of the optical flow and subsequently estimate the optical flow with a multi-resolution approach.

6.4.7.1 Iterative Refinement of the Optical Flow To derive an iterative flow estimation approach, let’s consider the one-dimensional case by indicating with I (x, t) and I (x, t + 1) the two signals (relative to the two images) in two instants of time (see Fig. 6.22). As shown in the figure, we can imagine that the time signal (t + 1) is a translation of that at the time t. If at any point on the graph, we knew exactly the speed u we could predict its position at the time t + 1 given by x = xt + u. With the above algorithms, we can estimate only an approximate velocity u˜ through the optical flow considered the continuity constraints of the brightness assumed with the optical flow equation having ignored the terms higher than the first order in Taylor’s expansion of I (x, y, t + 1). Rewritten in the 1D case, we have I x u + It = 0

=⇒

u=−

It I (x, t + 1) − I (x, t) ≈− Ix Ix

(6.74)

where the time derivative It has been approximated by the difference of the pixels of the two temporal images. The iterative algorithm for the estimation of the 1D motion of each pixel p(x) would result:

524 I

6 Motion Analysis I

u(xt)

I

u(xt+ε)=u(xt)-It/Ix

Time derivative

I(x,t)

It p(xt)

u(xt+1)

It

I(x,t) u

I(x,t+1)=I(x,t)

p(xt) I(x,t+1)

xt

Ix

I(x,t+1)

x

xt+ε

Ix

x

xt+1

x

Ix

Spatial derivative (Tangent slope)

Fig. 6.22 Iterative refinement of the optical flow. Representation of the signal 1D I (x, t) and I (x, t + 1) relative to the temporal images observed in two instants of time. Starting from an initial speed, the flow is updated by imagining to translate the signal I (x, t) to superimpose it over that of I (x, t + 1). Convergence occurs in a few iterations by calculating in each iteration the space–time gradient on the window centered on the pixel being processed which gradually shifts less and less until the two signals overlap. The flow update takes place with (6.75) and for the figure, signal occurs when the time derivative varies while the spatial derivative is constant

1. Compute for the p pixel the spatial derivative I x using the pixels close to p; 2. Set the initial speed of p. Normally we assume u ← 0; 3. Repeat until convergence: (a) Locate the pixel in the adjacent time image I (x , t + 1) by assuming the current speed u. Let I (x , t + 1) = I (x + u, t + 1) obtained by interpolation considering that the values of u are not always integers. (b) Compute the time derivative It = I (x , t + 1) − I (x, t + 1) as an approximation to the difference of the interpolated pixel intensities. (c) U pdate speed u according to Eq. (6.74). It is observed that during the iterative process the spatial derivative 1D I x remains constant while the speed u is refined in each iteration starting from a very approximate initial value or assumed zero. With Newton’s method (also known as the Newton– Raphson method), it is possible to generate a sequence of values of u starting from a plausible initial value that after a certain number of iterations converges to an approximation of the root of Eq. (6.74) in the hypothesis that I (x, t) is derivable. Therefore, given the signal I (x(t), t) know an initial value of u (k) (x), calculated, for example, with the Lucas–Kanade approach, it is possible to obtain the next value of the speed, solution of (6.74), with the general iterative formula: u k+1 (x) ← u (k) (x) −

It (x) I x (x)

(6.75)

where k indicates the iteration number. The iterative approach of the optical flow (u, v) extended to the 2D case is implemented considering Eq. (6.13), which we know to have the two unknowns u and v, and an equation, but solvable with the simple Lucas–Kanade method described in Sect. 6.4.4. The 2D iterative procedure is as follows:

6.4 Optical Flow Estimation

525

1. Compute the speeds (u, v) in each pixel of the image considering the adjacent images I 1 and I 2 (of a temporal sequence) using the Lucas–Kanade method, Eq. (6.49). 2. Transform (warp) the image I 1 into the image I 2 with bilinear interpolation (to calculate the intensities at the subpixel level) using the optical flow speeds previously calculated. 3. Repeat the previous steps until convergence. The convergence is when applying step 2 the translation of the image of time t leads to its overlap with the image of the time t + 1 and it follows that the ratio It /I x is null and the speed value is unchanged.

6.4.7.2 Optical Flow Estimation with a Multi-resolution Approach The compromise, of having large windows (to manage large displacements of moving objects) without violating the assumption of space–time coherence, is achieved by implementing the optical flow algorithms by organizing the image data of the temporal sequence with a pyramid structure [15,17]. In essence, the images are first convolved with a Gaussian filter and then sampled by a factor of two, a technique that takes the name of coarse to fine (see Fig. 6.23). Therefore, the Gaussian pyramid is first obtained (see Sect. 7.6.3 Vol. I) applied to the images It and It+1 of the sequence. Then, the calculation of the optical flow starts by processing the low-resolution images of the pyramids first (toward the top of the pyramid). The coar se result of the flow then passed as an initial estimate to repeat the process on the next level with higher resolution images thus producing more accurate values of the optical flow. Although the calculation of the optical flow in low-resolution stages is estimated on sub-sampled images with respect to the original, this ensures the taking of small movements and therefore the validity of the optical flow Eq. (6.9). The essential steps of Lucas–Kanade’s coarse-to-fine algorithm are the following: 1. Generates the Gaussian pyramids associated with the images I (x, y, t) and I (x, y, t + 1) (normally 3-level pyramids are used). 2. Compute the optical flow (u 0 , v0 ) with the simple method of Lucas–Kanade (LK) at the highest level of the pyramid (coarse); 3. Reapply LK iteratively over the current images to update (correct) the flow by converging normally within 5 iterations, which becomes the initial flow. 4. Remain to the level ith, perform the following steps: (a) Consider the speeds (u i−1 , vi−1 ) of the level i − 1; (b) E x pand flow matrices. Generates a matrix of (u i∗ , vi∗ ) with double resolution through an interpolation (cubic, bilinear, Gaussian, . . .) for the level ith; (c) Reapply LK iteratively to update the flow (u i∗ , vi∗ ) with the correction of (u i , vi ) considering the images of the pyramid of the level ith thus obtaining the updated flow: u i = u i∗ + u i ; vi = vi∗ + vi .

526

Initial estimate

Interpolates Trasf x2

Update Flow

x2

6 Motion Analysis

Fig. 6.23 Estimation of the optical flow with a multi-resolution approach by generating two Gaussian pyramids relative to the two temporal images. The initial estimate of the flow is made starting from the coarse images (at the top of the pyramid) and this flow is propagated to the subsequent levels until it reaches the original image. With the coarse-to-fine approach, it is possible to handle large object movements without violating the assumptions of the optical flow equation

Figure 6.24 shows instead the results of the Lucas–Kanade method for the management of large movements calculated on real images. In this case, the multi-resolution approach based on the Gaussian pyramid was used and iteratively refining the flow calculation in each of the three levels of the pyramid.

6.4.8 Motion Estimation by Alignment In different applications, it is required to compare images and evaluate the level of similarity (for example, to compare images of a scene shot at different times or from slightly different points of view) or to verify the presence in an image of its parts (blob) (as it happens for the stereo-vision in finding elementary parts ( patches) homologous). For example, in the 1970s, large satellite images were available and the geometric registration of several images of the territory acquired at different times and from slightly different observation positions was strategic.

6.4 Optical Flow Estimation

527

Fig. 6.24 Results of the optical flow calculated on real images with Lucas–Kanade’s method for managing large movements that involves an organization with Gaussian pyramid at three levels of the adjacent images of the temporal sequence and an iterative refinement at every level of the pyramid

The motion of the satellite caused complex geometric deformations in the images, requiring a process of geometric transformation and resampling of images based on the knowledge of points of reference of the territory (control points, landmarks). Of these points, windows (samples of images) were available, to be searched later in the images, useful for registration. The search for such sample patchs of sizes n × n in the images to be aligned occurs using the classic gray-level comparison methods which minimizes an error function based on the sum of the square of the differences (SSD) described in Sect. 5.8 Vol. II, or on the sum of the absolute value of the differences (SAD) or on the normalized cross-correlation (NCC). The latter functional SAD and NCC have been introduced in Sect. 4.7.2. In the preceding paragraphs, the motion field was calculated by considering the simple translation motion of each pixel of the image by processing (with the differential method) consecutive images of a sequence. The differential method estimates the motion of each pixel by generating a dense motion field map (optical flow), does not perform a local search around the pixel being processed (it only calculates the local space–time gradient) but only estimates limited motion (works on high frame rate image sequences). With multi-resolution implementation, large movements can be estimated but with a high computational cost. The search for a patch model in the sequence of images can be formulated to estimate the motion with large displacements that are not necessarily translational. Lucas and Kanade [12] have proposed a general method that aligns a portion of an image known as a sample image (template image) T (x) with respect to an input

528

6 Motion Analysis

image I (x), where x = (x, y) indicates the coordinates of the processing pixel where the template is centered. This method can be applied for the motion field estimation by considering a generic patch of size n × n from the image at time t and search for it in the image at time t + t. The goal is to find the position of the template T (x) in the successive images of the temporal sequence (template tracking process). In this formulation, the search for the alignment of the template in the sequence of varying space–time images takes place considering the variability of the intensity of the pixels (due to the noise or to the changing of the acquisition conditions), and a model of motion described by the function W (x; p), parameterizable in terms of the parameters p = ( p1 , p2 , . . . , pm ). Essentially, the motion (u(x), v(x)) of the template T (x) is estimated through the geometric transformation (warping transformation) W (x; p) that aligns a portion of the deformed image I (W (x; p)) with respect to T (x): I (W (x; p)) ≈ T (x) (6.76) Basically, the alignment (registration) of T (x) with patch of I (x) occurs by deforming I (x) to match it with T (x). For example, for a simple motion of translation the transformation function W (x; p) is given by   x +u x + p1 = W (x; p) = (6.77) y + p2 y+v where, in this case, the vector parameter p = ( p1 , p2 ) = (u, v) represents the optical flow. More complex motion models are described with 2D geometrical transformations W (x; p), already reported in the Chap. 3 Vol. II, such as rigid Euclidean, similarity, affine, projective (homography). In general, an affine transformation is considered, that in homogeneous coordinates is given by ⎛ ⎞  x  p1 p3 p5 ⎝ ⎠ p1 x + p3 y + p5 y = (6.78) W (x; p) = p2 x + p4 y + p6 p2 p4 p6 1 The Lucas–Kanade algorithm solves the alignment problem (6.76) by minimizing the error function e SS D , based on the SSD (Sum of Squared Differences) which compares the pixel intensity of the template T (x) and those of the transformed image I (W (x; p)) (prediction of the warped image), given by:  2 I (W (x; p)) − T (x) (6.79) e SS D = x∈T

The minimization of this functional occurs by looking for the unknown vector p calculating the pixel residuals of the template T (x) in a search region D × D in I . In the case of a affine motion model (6.78), the unknown parameters are 6. The minimization process assumes that the entire template is visible in the input image I and the pixel intensity does not vary significantly. The error function (6.79) to be minimized is non-linear even if the deformation function W (x; p) is linear with

6.4 Optical Flow Estimation

529

respect to p. This is because the intensity of the pixels varies regardless of their spatial position. The Lucas–Kanade algorithm uses the Gauss–Newton approximation method to minimize the error function through an iterative process. It assumes that an initial value of the estimate of p is known, and through an iterative process is increased by p such that it minimizes the error function expressed in the following form (parametric model):  2 (6.80) I (W (x; p + p)) − T (x) x∈T

In this way, the assumed known initial estimate is updated iteratively by solving the function error with respect to p: p ← p + p

(6.81)

until they converge, checking that the norm of the p is below a certain threshold value ε. Since the algorithm is based on the nonlinear optimization of a quadratic function with the gradient descent method, (6.80) is linearized approximating I (W (x; p + p)) with his Taylor series expansion up to the first order, obtaining ∂W p (6.82) I (W (x; p + p)) ≈ I (W (x; p)) + ∇ I ∂p where ∇ I = (I x , I y ) is the spatial gradient of the input image I locally evaluated according to W (x; p), that is, ∇ I is calculated in the coordinate reference system of I and then transformed into the coordinates of the template image T using the W is the current estimate of the parameters p of the deformation W (x; p). The term ∂∂p Jacobian of the function W (x; p) given by: ⎛ ∂W ∂W ⎞ x x x · · · ∂∂W ∂ p ∂ p p m 1 2 ∂W ⎜ ⎟ =⎝ (6.83) ⎠ ∂p ∂ Wy ∂ Wy ∂ Wy ∂ p1 ∂ p2 · · · ∂ pm having considered W (x; p) = (Wx (x; p), W y (x; p)). In the case of affine deformation (see Eq. 6.78), the Jacobian results:  ∂W x 0 y010 = (6.84) 0x 0y01 ∂p If instead  deformation is of pure translation, the Jacobian corresponds to the  the matrix 01 01 . Replacing (6.82) in the error function (6.80) we get e SS D =

 x∈T

2

  ∂W p − T (x) I (W x; p) + ∇ I ∂p

(6.85)

530

6 Motion Analysis

The latter function is minimized as a least squares problem. Therefore, differentiating it with respect to the unknown vector p and equaling to zero we obtain  ∂ W T   ∂W ∂e SS D ∇I I (W x; p) + ∇ I =2 p − T (x) = 0 ∂p ∂p ∂p

(6.86)

x∈T

W where ∇ I ∂∂p means the Steepest Descent term. Solving with respect to the unknown p, the function (6.80) is minimized, with the following correction parameter:

p = H −1

 ∂ W T    ∇I T (x) − I (W x; p) ∂p x∈T   

(6.87)

Term of steepest descend with error

where H=

 x∈T

∂W ∇I ∂p

T ∂W ∇I ∂p

(6.88)

is the Hessian matrix (of second derivatives, of size m × m) of the deformed image I (W x; p). (6.88) is justified by remembering that the Hessian matrix actually represents the Jacobian of the gradient (concisely H = J · ∇). Equation (6.87) represents the expression that calculates the increment of p to update p and through the cycle of predict–correct to converge toward a minimum of the error function (6.80). Returning to the Lucas–Kanade algorithm, the iterative procedure expects to apply Eqs. (6.80) and (6.81) in each step. The essential steps of the Lucas–Kanade alignment algorithm are   1. Compute I W (x, p) transforming I (x)  with the matrix W (x, p); 2. Compute the similarity value (error): I W (x, p) − T (x); 3. Compute war ped gradients ∇ I = (I x , I y ) with the transformation W (x, p); W at (x, p); 4. Evaluate the Jacobian of the warping ∂∂p 5. 6. 7. 8. 9. 10.

W ; Compute steepest descent: ∇ I ∂∂p Compute the Hessian matrix with Eq. (6.88); Multi ply steepest descend with error, indicated in Eq. (6.87); Compute p using Eq. (6.87); U pdate the parameters of the motion model: p ← p + p; Repeat until p < ε

Figure 6.25 shows the functional scheme of the Lucas–Kanade algorithm. In summary, the described algorithm is based on the iterative approach prediction– correction. The pr ediction  consists of the calculation of the transformed input image (deformed) I W (x, p) starting from an initial estimate of the parameter vector p once the parametric motion model has been defined (translation, rigid, affine,

6.4 Optical Flow Estimation

531 Gradient x of I

Gradient y of I

Template

Step 2

T(x) I(x)

I(x) deformed

Step 3

Step 1 I(W(x;p))

Step 4

W parameters

Gradient of I deformed

Jacobian of W

Step 9 Δ 1

Δp Updating

2

3

4

5

p

6

Ix

m x 2m

Δ

∂W ∂p

Iy

2m x 6m

Inverse Hessian Step 5 Step 8

2

3

Δp

4

5

6

6x1

Hessian

Step 2

Images of steepest descent

6x6

Step 8

1

Step 6

Δ ∂W I ∂p

H Steepest descent parameters update 6x1

Error

m x 6m

4 3 2 1

Step 7

0

Step 7

-1 -2

T(x) - I(W(x;p))

-3

Δ ∂W ∑[ I ∂p ] [T(x)-I(W(x;p))] x 1

T

2

3

4

5

6

Fig. 6.25 Functional scheme of the reported Lucas–Kanade alignment algorithm (reported by Baker [18]). With step 1 the input image I is deformed with the current estimate of W (x; p) (thus calculating the prediction) and the result obtained is subtracted from the template T (x) (step 2) thus obtaining the error function T (x) − I (W (x; p)) between prediction and template. The rest of the steps are described in the text

homography, . . .). The correction vector p is calculated as a function of the error given by the difference between the prototype image T (x) and the deformed input  image I W (x, p) . The convergence of the algorithm also depends on the magnitude of the initial error, especially if one does not have a good initial estimate of p. The deformation function W (x, p) must be differentiable with respect to p considering W must be computed. The convergence of the algorithm that the Jacobian matrix ∂∂p is very much influenced by the inverse of the Hessian matrix H −1 , which can be analyzed through its eigenvalues.

532

6 Motion Analysis

For example, in the case of pure translation with the Jacobian given by Hessian matrix results:  ∂ W T ∂ W ∇I ∇I H= ∂p ∂p x∈T     10  10 Ix  Ix I y = Iy 01 01 x∈T   I 2 Ix I y x = I x I y I y2

10 01

the

x∈T

Therefore, the Hessian matrix thus obtained for the pure translational motion corresponds to that of the Harris corner detector already analyzed in Sect. 6.5 Vol. II. It follows that, from the analysis of the eigenvalues (both high values) of H it can be verified if the template is a good patch to search for in the images of the sequence in the translation context. In applications where the sequence of images has large movements the algorithm can be implemented using a multi-resolution data structure, for example, using the coarse to fine approach with pyramidal structure of the images. It is also sensitive if the actual motion model is very different from the one predicted and if the lighting conditions are modified. Any occlusions become a problem for convergence. A possible mitigation of these problems can be achieved by updating the template image. From a computational point of view, the complexity is O(m 2 N + m 3 ) where m is the number of parameters and N is the number of pixels in the template image. In [18], the details of the computational load related to each of the first 9 steps above of the algorithm are reported. In [19], there are some tricks (processing of the input image in elementary blocks) to reduce the computational load due in particular to the calculation of the Hessian matrix and the accumulation of residuals. Later, as an alternative to the Lucas–Kanade algorithm, other equivalent methods have been developed to minimize the error function (6.79).

6.4.8.1 Compositional Image Alignment Algorithm In [19] a different approximation method is proposed, called compositional algorithm, which modifies the error function to be minimized (6.80) in the following:    2 I W (W (x; p); p) − T (x) (6.89) x∈T

First it is minimized with respect to p, in each iteration, and then the deformation estimate is updated as follows: W (x; p) ← W (x; p) ◦ W (x; p) ≡ W (W (x; p); p)

(6.90)

6.4 Optical Flow Estimation

533

In this expression, the “◦” symbol actually indicates a simple linear combination of the parameters of W (x; p) and W (x; p), and the final form is rewritten as the compositional deformation W (W (•)). The substantial difference of the compositional algorithm with respect to the Lucas–Kanade algorithm is represented by the iterative incremental deformation W (x; p) rather than the additive updating of the parameters p. In essence, Eqs. (6.80) and (6.81) of the original method are replaced with (6.89) and (6.90) of the compositional method. In other words, this variant in the iterative approximation involves updating W (x; p) through the two compositional deformations given with the (6.90). The compositional algorithm involves the following steps: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

  Compute I W (x, p) transforming I (x)  with the matrix W (x, p); Compute the similarity value (error): I W (x, p)  − T (x);  Compute the gradient ∇ I (W ) of the image I W (x, p) ; W at (x, 0). This step is only performed in the beginning Evaluate the Jacobian ∂∂p by pre-calculating the Jacobian at (x, 0) which remains constant. W ; Compute steepest descent: ∇ I ∂∂p Compute the Hessian matrix with Eq. (6.88); Multi ply steepest descend with error, indicated in Eq. (6.87); Compute p using Eq. (6.87); Update the parameters of the motion model with Eq. (6.90); Ripeti finché p < ε

Basically, the same procedure as in Lucas–Kanade does, except for the  steps shown  in bold, that is, step 3 where the gradient of the image is calculated I W (x, p) , step 4 which is executed at the beginning out of the iterative process, initially calculating the Jacobian at (x, 0), and step 9 where the deformation W (x, p) is updated with the new Eq. (6.90). This new approach is more suitable for more complex motion models such as the homography one, where the Jacobian calculation is simplified even if the computational load is equivalent to that of the Lucas–Kanade algorithm.

6.4.8.2 Inverse Compositional Image Alignment Algorithm In [18], another variant of the Lucas–Kanade algorithm is described, known as the Inverse compositional image alignment algorithm which substantially reverses the role of the input image with that of the template. Rather than updating the additive estimate of the parameters p of the deformation W , this algorithm solves iteratively for the inverse incremental deformation W (x; p)−1 . Therefore the alignment of the image consists in moving (deforming) the template image to minimize the difference between template and an image. The author demonstrates the equivalence of this algorithm with the compositional algorithm. With the inverse compositional

534

6 Motion Analysis

algorithm, the function to be minimized, with respect to p, results:     2 T W (x; p) − I W (x; p)

(6.91)

x∈T

where, as you can see, the role of the I and T images is reversed. In this case, as suggested by the name, the minimization problem of (6.91) is solved by updating the estimated current deformation W (x; p) with the inverted incremental deformation W (x; p)−1 , given by W (x; p) ← W (x; p) ◦ W (x; p)−1

(6.92)

To better highlight the variants of the algorithms, that of Lucas–Kanade is indicated as the forward additive algorithm. The compositional one expressed by (6.90) is referred to as the compositional forward algorithm which differs by virtue of (6.92) due to the fact that the incremental deformation W (x; p) is inverted before being composed with the current estimate of W (x; p). Therefore, updating the deformation W (x; p) instead of p makes the inverse compositional algorithm more suitable for any type of deformation. With the expansion of the function (6.91) in Taylor series until the first order is obtained:     2  ∂W p − I W (x; p) (6.93) T W (x; 0) + ∇T ∂p x∈T

If we assume W (x; 0) equals the identity deformation (without loss of generality), that is, W (x; 0) = x, the solution to the least squares problem is given by p = H

−1

T    ∂W   ∇T I W (x; p) − T (x) ∂p

(6.94)

x∈T

where H is the Hessian matrix, but with I replaced with T , given by H=

 x∈T

∇T

∂W ∂p

T ∂W ∇T ∂p

(6.95)

W with the Jacobian ∂∂p valued at (x; 0). Since the Hessian matrix is independent of the parameter vector p, it remains constant during the iterative process. Therefore, instead of calculating the Hessian matrix in each iteration, as happens with the Lucas–Kanade algorithm and the forward compositional, it is possible to pre-calculate the Hessian matrix before starting the iterative process with the advantage of considerably reducing the computational load. The inverse compositional algorithm is implemented in two phases:

6.4 Optical Flow Estimation

535

Initial pre-calculation steps: 3. 4. 5. 6.

Compute the gradient ∇T of the template T (x); W at (x; 0); Compute the Jacobian ∂∂p W ; Compute steepest descent term ∇T ∂∂p Compute the Hessian matrix using Eq. (6.95);

Steps in the iterative process: 1. 2. 7–8.

9. 10.

  Compute I W (x, p) transforming I (x) with thematrix W (x, p); Compute the value of similarity (errore image): I W (x, p) − T (x); Compute the incremental value of the motion model parameters Eq. (6.94);

p using

Update the current deformation W (x; p) ← W (x; p) ◦ W (x; p)−1 Repeat until p < ε

The substantial difference between the forward compositional algorithm and the inverse compositional algorithm concerns: the calculation of the similarity value (step 1) having exchanged the role between input image and template; steps 3, 5, and 6 calculating the gradient of T instead of the gradient of I , with the addition of being able to precompute it out of the iterative process; the calculation of p which is done with (6.94) instead of (6.87); and finally, step 9, where the incremental deformation W (x; p) is reversed before being composed with the current estimate. Regarding the computational load, the inverse computational algorithm requires a computational complexity of O(m 2 N ) for the initial pre-calculation steps (executed only once), where m is the number of parameters and N is the number of pixels in the template image T . For the steps of the iterative process, a computational complexity of O(m N + m 3 ) is required for each iteration. Essentially, compared to the Lucas–Kanade and compositional algorithms, we have a computational load saving of O(m N + m 2 ) for each iteration. In particular, the greatest computation times are required for the calculation of the Hessian matrix (step 6) although it is done only once while keeping in memory the data of the matrix H and of the images W . of the steepest descent ∇T ∂∂p

6.4.9 Motion Estimation with Techniques Based on Interest Points In the previous section, we estimated the optical flow considering sequences of images with a very small time interval between two consecutive images. For example, for images taken with a standard 50 Hz camera, with a 1/25 s interval or with a much higher frame rate even of hundreds of images per second. In some applications, the dynamics of the scene involves longer intervals and, in this case, the calculation of

536

6 Motion Analysis

the motion field can be done using techniques based on the identification, in the images of the sequence in question, of some significant structures (Points Of Interest (POI)). In other words, the motion estimation is calculated by first identifying the homologous points of interest in the consecutive images (with the correspondence problem analogous to that of the stereo vision) and measuring the disparity value between the homologous points of interest. With this method, the resulting speed map is a scattered speed map, unlike previous methods that generated a dense speed map. To determine the dynamics of the scene from the sequence of time-varying images, the following steps must be performed: 1. De f ine a method of identifying points of interest (see Chap. 6 Vol. II) to be searched in consecutive images; 2. Calculate the Significant Points of Interest (SPI); 3. Find the homologous SPI points in the two consecutive images of the time sequence; 4. Calculate the disparity values (relative displacement in the image) for each point of interest and estimate the speed components; 5. T racking the points of interest identified in step 3 to determine the dynamics of the scene in a more robust way. For images with n pixel, the computational complexity to search for points of interest in the two consecutive images is O(n 2 ). To simplify the calculation of these points, we normally consider windows of minimum 3 × 3 with high brightness variance. In essence, the points of interest that emerge in the two images are those with high variance that normally are found in correspondence of corners, edges, and in general in areas with strong discontinuity of brightness. The search for points of interest (step 2) and the search for homologous (step 3) between consecutive images of the time sequence is carried out using the appropriate methods (Moravec, Harris, Tomasi, Lowe, . . .) described in Chap. 6 Vol. II. In particular, the methods for finding homologous points of interest have also been described in Sect. 4.7.2 in the context of stereo vision for the problem of the correspondence between stereo images. The best known algorithm in the literature for the tracking of points of interest is the KLT (Kanade-Lucas–Tomasi), which integrates the Lucas– Kanade method for the calculation of the optical flow, the Tomasi–Shi method for the detection of points of interest, and the Kanade–Tomasi method for the ability to tracking points of interest in a sequence of time-varying images. In Sect. 6.4.8, we have already described the method of aligning a patch of the image in the temporal sequence of images. The essential steps of the KLT algorithm are 1. Find the POIs in the first image of the sequence with one of the methods above that satisfy min(λ1 , λ2 ) > λ (default threshold value); 2. For each POI, apply a motion model (translation, affine, . . .) to calculate the displacement of these points in the next image of the sequence. For example, alignment algorithms based on the Lucas–Kanade method can be used;

6.4 Optical Flow Estimation

537

3. Keep track of the motion vectors of these POIs in the sequence images; 4. Optionally, it may be useful to activate the POI detector (step 1) to add more to follow. Step to execute every m processed images of the sequence (for example every 10–20 images); 5. Repeat steps 2.3 and optionally step 4; 6. KLT returns the vectors that track the points of interest found in the image sequence. The KLT algorithm would automatically track the points of interest in the images of the sequence compatibly with the robustness of the detection algorithms of the points and the reliability of the tracking influenced by the variability of the contrast of the images, by the noise, by the lighting conditions that must vary little and above all from the motion model. In fact, if the motion model (for example, of translation or affine) changes a lot with objects that move many pixels in the images or change of scale, there will be problems of tracking with points of interest that may appear partially and not more detectable. It may happen that in the phase of tracking a point of interest detected has identical characteristics but belongs to different objects.

6.4.9.1 Tracking Using the Similarity of Interest Points In application contexts, where the lighting conditions can change considerably together with the geometric variations, due not only to the rotation and change of scale, but also to the affine distortion, can be useful a tracking based on the similarity of points of interest with robust characteristics of geometric and radiometric invariance. In these cases, the KLT tracking algorithm, described above, is modified in the first step to detect invariant points of interest with respect to the expected distortions and lighting variations, and in the second phase it is evaluated the similarity of the points of interest found by the first image to successive images of the sequence. In Sect. 6.7.1 Vol. II, the characteristics of the points of interest SIFT (proposed by Lowe [20]) detected through an approach coarse to fine (based on Gaussian pyramid) for the invariance to change of scale, while the DoG (Difference of Gaussian) pyramid is used for the locating of the points of interest at the various scales. The orientation of the POIs is calculated through the locally evaluated gradient and finally for each POI there is a descriptor that captures the local radiometric information. Once the points of interest have been detected, it is necessary to find the correspondence of the points in the consecutive images of the sequence. This is achieved by evaluating a similarity measure between the characteristics of the corresponding points with the foresight that must be significant with respect to the invariance properties. This similarity measure is obtained by calculating the sum of the squared differences (SSD) of the characteristics of the points of interest found in the consecutive images and considering as the corresponding candidate points those with an SSD value lower than a predefined threshold. In Fig. 6.26, the significant POIs of a sequence of 5 images are highlighted with an asterisk, detected with the SIFT algorithm. The first line shows (with an asterisk) all the POIs found in 5 consecutive images. The significant rotation motion toward

538

6 Motion Analysis

Fig. 6.26 Points of interest detected with Lowe’s SIFT algorithm for a sequence of 5 images captured by a mobile vehicle

(a)

(b)

(c)

Fig. 6.27 Results of the correspondence of the points of interest in Fig. 6.26. a Correspondence of the points of interest relative to the first two images of the sequence calculated with Harris algorithm. b As in a but related to the correspondence of the points of interest SIFT. c Report the tracking of the SIFT points for the entire sequence. We observe the correct correspondence of the points (the trajectories do not intersect each other) invariant to rotation, scale, and brightness variation

the right side of the corridor is highlighted. In the second line, only the POIs found corresponding between two consecutive images are shown, useful for motion detection. Given the real navigation context, tracking points of interest must be invariant to the variation of lighting conditions, rotation, and, above all, scale change. In fact, the mobile vehicle during the tracking of the corresponding points captures images of the scene where the lighting conditions vary considerably (we observe reflections with specular areas) and between one image and another consecutive image of the sequence the points of interest can be rotated and with a different scale. The figure also shows that some points of interest present in the previous image are no longer visible in the next image, while the latter contains new points of interest not visible in the previous image due to the dynamics of the scene [21]. Figure 6.27 shows the results of the correspondence of the points of interest SIFT of Fig. 6.26 and the correspondence of the points of interest calculated with Har-

6.4 Optical Flow Estimation

539

ris algorithm. While for SIFT points, the similarity is calculated using the SIFT descri ptor s for Harris corners the similarity measurement is calculated with the SSD considering a square window centered on the position of the corresponding corners located in the two consecutive images. In figure (a), the correspondences found in the first two images of the sequence are reported, detected with the corners of Harris. We observe the correct correspondence (those relating to nonintersecting lines) for corners that are translated or slightly rotated (invariance to translation) while for scaled corners the correspondence is incorrect because they are non-invariant to the change of scale. Figure (b) is the analogous of (a) but the correspondences of the points of interest SIFT are reported, that being also invariant to the change of scale, they are all correct correspondences (zero intersections). Finally, in figure (c) the tracking of the SIFT points for the whole sequence is shown. We observe the correct correspondence of the points (the trajectories do not intersect each other) being invariant with respect to rotation, scale, and brightness variation. There are different methods for finding the optimal correspondence between a set of points of interest, considering also the possibility that for the dynamics of the scene some points of interest in the following image may not be present. To eliminate possible false correspondences and to reduce the times of the computational process of the correspondence, constraints can be considered in analogy to what happens for stereo vision, where the constraints of the epipolarity are imposed or knowing the kinematics of the objects in motion it is possible to predict the position of points of interest. For example, the correspondences of significant point pairs can be considered to make the correspondence process more robust, placing constraints on the basis of a priori knowledge that can be had on the dynamics of the scene.

6.4.9.2 Tracking Using the Similarity Between Graphs of Points of Interest Robust correspondence approaches are based on organizing with a graph the set of points of interest P O I1 of the first image and comparing all the possible graphs generated by the points of interest P O I2 of the consecutive image. In this case, the correspondence problem would be reduced to the problem of the comparison between graphs which is called isomor phism between graphs which is a difficult problem to solve, especially when applied with noisy data. A simplification is obtained by considering graphs consisting of only two nodes that represent two points of interest in an image where, in addition to their similarity characteristics, topological information (for example the distance) of points of interest is also considered. The set of potential correspondences forms a bipartite graph, and the correspondence problem consists in choosing a particular coverage of this graph. Initially, each node can be considered as the correspondent of every other node of the other partition (joining with a straight line these correspondences intersect). Using some similarity criteria, the goal of the correspondence problem is to remove all connections of the entire bipartite graph except one for each node (where there would be no intersections). Considering the dynamics of the scene for some nodes, no correspondence

540

6 Motion Analysis

would be found (also because they are no longer visible) and are excluded for motion analysis.

6.4.9.3 Tracking Based on the Probabilistic Correspondence of Points of Interest An iterative approach, to solve the correspondence problem, is the one proposed by Barnard and Thompson [22], which is based on the estimation of the probability of finding the correspondence of points in the two consecutive images placing the constraint of finding the homologous point in another image within a defined distance based on the dynamics of the scene. Basically, assuming a maximum speed on the motion of the points of interest, the number of possible matches decreases and a correspondence probability value is estimated for each potential pair of corresponding points. These probabilities are computed in an iterative way to obtain an optimal global probability measure calculated for all possible pairs of points whose correspondence is to be calculated. The process ends when the correspondences of each point in the first image with a point of interest are found in the next image of the sequence, with the constraint that the global probability of all the correspondences is significantly higher than any other possible set of correspondences, or is greater than a predefined threshold value. We now describe this algorithm. In analogy to the stereo approach, we consider two images acquired over time t and t + t with t very small. The temporal acquisition distance is assumed to be very small and we assume that we have determined a set of points of interest A1 = {x1 , x2 , . . . , xm }, in the first image at time t, and a set of points of interest A2 = {y1 , y2 , . . . , yn } in the second image at the time t +t. If the generic point of interest x j is as consistent as the corresponding point yk , that is, they represent a potential correspondence with the best similarity, a velocity vector c jk is associated to this potential correspondence with an initial probability value based on their similarity. With this definition, if the point of interest x j moves in the image plane with velocity v, it will correspond in the second image to the point yk given by yk = x j + vt = x j + c jk

(6.96)

where the vector c jk can be seen as the vector connecting the points of interest x j and yk . This pair of homologous points has a good correspondence if the following condition is satisfied: (6.97) |x j − yk | ≤ cmax where cmax indicates the maximum displacement (disparity) of x j in the time interval t found in the next image with the homologous point yk . Two pairs of homologous points (x j , yk ) and (x p , yq ) are declared consistent if they satisfy the following condition: |c jk − c pq | ≤ cost

6.4 Optical Flow Estimation

541

where cost is an experimentally calculated constant based on the dynamics of the scene known a priori. Initially, an estimate of the correspondence of potential homologous points of interest can be calculated, based on their similarity measure. Each point of interest x j , in the image being processed, and yk , in the next image, identifies a window W of appropriate size (3 × 3, 5 × 5, . . .). A similarity measure, between potential homologous points, can be considered the sum of the squares of the gray-level differences, corresponding to the window considered. Indicating with S jk this similarity estimate between the points of interest x j and yk , we get  S jk = [I1 (s, t) − I2 (s, t)]2 (6.98) (s,t)∈W

where I1 (s, t) and I2 (s, t) indicate the respective pixels of the windows W j and Wk , respectively, centered, respectively, on the points of interest x j and yk in the corresponding images. An initial value of the estimate of the correspondence of a 0 , generic pair of points of interest (x j , yk ), expressed in terms of probability P jk of potential points homologous x j and yk , is calculated, according to the similarity measure given by (6.98), as follows: (0)

P jk =

 (s,t)∈W

1 1 + αS jk

(6.99)

(0)

where α is a positive constant. The probability P jk is determined by considering the fact that a certain number of POIs have a good similarity, excluding those that are inconsistent as a value of similarity, for which a value of probability equal to  (0)  ) can be assumed. The probabilities of the various possible matches 1 − max(P jk are given by (0) P jk P jk = n (6.100) t=1 P jt where P jk can be considered as the conditional probability that x j has the homologous point yk , normalized on the sum of the probabilities of all other potential points {y1 , y2 , . . . , yn }, excluding those found with inconsistent similarity. The essential steps of the complete algorithm for calculating the flow rate of two consecutive images of the sequence are the following: 1. Calculate the set of points of interest A1 and A2 , respectively, in the two consecutive images It and It+t . 2. Organize a data structure among the potential points of correspondence for each point x j ∈ A1 with points yk ∈ A2 : {x j ∈ A1 , (c j1 , P j1 ), (c j2 , P j2 ), . . . , (β, γ )}

j = 1, 2, . . . , m

542

6 Motion Analysis

where P jk is the probability of matching points x j and yk while β and γ are special symbols that indicate the non-potential correspondence. 0 with Eq. (6.99) based on the similarity of the POIs 3. Initialize the probabilities P jk given by Eq. (6.98) once you have chosen the appropriate size W of the window in relation to the dynamics of the scene. 4. Iteratively calculate the probability of matching of a point x j with all the potential points (yk , k = 1, . . . , n) as the weighted sum of all the probabilities of correspondence for all consistent pairs (x p , yq ), where the x p points are in the vicinity of x j (while i points yq are in the vicinity of yk ) and consistency (x p , yq ) is evaluated according to the pair (x j , yk ). A quality measure Q jk of the matching pair is given by  (s−1) (s−1) Q jk = Ppq (6.101) p

q

where s is the iteration step, p refers to all x p points that are in the vicinity of the point of interest x j being processed and index q refers to all the yq ∈ A2 points that form pairs (x p , yq ) consistent with pairs (x j , yk ) (points that are not consistent or with probability below a certain threshold are excluded). 5. Update correspondence probabilities for each pair (x j , yk ) as follows: # $ (s) (s−1) (s−1) a + bQ jk Pˆ jk = Pˆ jk (s) is useful to normalize it with with a and b default constants. The probability Pˆ jk the following: (s) Pˆ jk (s) P jk = n (6.102) ˆ (s) t=1 P jt

6. Iterate steps (4) and (5) until the best match (x j , yk ) is found for all the points examined x j ∈ A1 . 7. The vectors c jk constitute the velocity fields of the analyzed motion. The selection of the constants a and b conditions the convergence speed of the algorithm that normally converges after a few iterations. The algorithm can be speedup (0) that are very low by eliminating correspondences with initial probability values P jk below a certain threshold.

6.4.10 Tracking Based on the Object Dynamics—Kalman Filter When we know the dynamics of a moving object, instead of independently determining (in each image of the sequence) the position of the points of interest that characterize it, it may be more effective to set the tracking of the object considering the knowledge of its dynamics. In this case, the object tracking is simplified by

6.4 Optical Flow Estimation

543

y z x

y x

Fig. 6.28 Tracking of the ball by detecting its dynamics using the information of its displacement calculated in each image of the sequence knowing the camera’s frame rate. In the last image on the right, the Goal event is displayed

calculating the position of its center of mass, independently in each image of the sequence, also based on a prediction estimate of the expected position of the object. In some real applications, the dynamics of objects is known a priori. For example in the tracking of a pedestrian or in the tracking of entities (players, football, . . .) in sporting events, they are normally shot with appropriate cameras having a frame rate appropriate to the intrinsic dynamics of these entities. For example, the tracking of the ball (in the game of football or tennis) would initially predict its location in an image of the sequence, analyzing the image entirely, but in subsequent images of the sequence its location can be simplified by the knowledge of the motion model that would predict its current position. This tracking strategy reduces the search time and improves, with the prediction of the dynamics, the estimate of the position of the object, normally influenced by the noise. In object tracking, the camera is stationary and the object must be visible (otherwise the tracking procedure must be reinitialized to search for the object in a sequence image, as we will see later), and in the sequence acquisition phase the object–camera geometric configuration must not change significantly (normally the object moves with lateral motion with respect to the optical axis of the stationary camera). Figure 6.28 shows the tracking context of the ball in the game of football to detect the Goal–NoGoal event.7 The camera continuously acquires sequences of images and as soon as the ball appears in the scene a ball detection process locates it in an image of the sequence and begins a phase of ball tracking in the subsequent images. Using the model of the expected movement, it is possible to predict where the ball-object is located in the next image. In this context, the Kalman filter can be used considering the dynamics of the event that presents uncertain information (the ball can be deviated) and it is possible to predict the next state of the ball. Although in reality, external elements interfere with the predicted movement model, with the Kalman filter one is often able to understand what happened. The Kalman filter is ideal for continuously changing dynamics. A Kalman filter is an optimal estimator, i.e., it highlights parameters of interest from indirect, inaccurate, and uncertain

7 Goal–NoGoal

event, according to FIFA regulations, occurs when the ball passes entirely the ideal vertical plane parallel to the door, passing through the inner edge of the horizontal white line separating the playing field from the inner area of the goal itself.

544

6 Motion Analysis

observations. It operates recursively by evaluating the next state based on the previous state and does not need to keep the historical data of the event dynamics. It is, therefore, suitable for real-time implementations, and therefore strategic for the tracking of high-speed objects. It is an optimal estimator in the sense that if the noise of the data of the problem is Gaussian, the Kalman filter minimizes the mean square error of the estimated parameters. If the noise were not Gaussian (that is, for data noise only the average and standard deviation are known), the Kalman filter is still the best linear estimator but nonlinear estimators could be better. The word f ilter must not be associated with the most common meaning that removes the frequencies of a signal but must be understood as the process that finds the best estimate from noisy data or to filter (attenuate) the noise. Now let’s see how the Kalman filter is formulated [8,9]. We need to define the state of a deterministic discrete dynamic system, described by a vector with the smallest possible number of components, which completely synthesizes the past of the system. The knowledge of the state allows theoretically to predict the dynamics and future (and previous) states of the deterministic system in the absence of noise. In the context of the ball tracking, the state of the system could be described by the vector x = (p, v), where p = (x, y) and v = (vx , v y ) indicate the position of the center of mass of the ball in the images and the ball velocity, respectively. The dynamics of the ball is simplified by assuming constant velocity during the tracking and neglecting the effect of gravity on the motion of the ball. This velocity is initially estimated by knowing the camera’s frame rate and evaluating the displacement of the ball in a few images of the sequence, as soon as the ball appears in the field of view of the camera (see Fig. 6.28). But nothing is known about unforeseeable external events (such as wind and player deviations) that can change the motion of the ball. Therefore, the next state is not determined with certainty and the Kalman filter assumes that the state variables p and v may vary randomly with the Gaussian distribution characterized by the mean μ and variance σ 2 which represents the uncertainty. If the prediction is maintained, knowing the previous state, we can estimate in the next image where the ball would be, given by: pt = pt−1 + tv where t indicates the current state associated with the image tth of the sequence, t–1 indicates the previous state and t indicates the time elapsed between two adjacent images, defined by the camera’s frame rate. In this case, the two quantities, position and velocity of the ball, are corr elated. Therefore, in every time interval, we have that the state changes from xt−1 to xt according to the prediction model and according to the new observed measures zt evaluated independently from the prediction model. In this context, the observed measure zt = (x, y) indicates the position of the center of mass of the ball in the image tth of the sequence. We can thus evaluate a new measure of the tracking state (i.e., estimate the new position and speed of the ball, processing the tth image of the sequence directly). At this point, with the Kalman filter one is able to estimate the

6.4 Optical Flow Estimation

545

current state xˆ t by filtering out the uncertainties (generated by the measurements zt and/or from the prediction model xt ) optimally with the following equation: xˆ t = Kt · zt + (1 − Kt )ˆxt−1

(6.103)

where Kt is a coefficient called Kalman Gain and t indicates the current state of the system. In this case, t is used as an index of the images of the sequence but in substance it has the meaning of discretizing time such that t = 1, 2, . . . indicates t · t ms (millisecondi), where t is the constant time interval between two successive images of the sequence (defined by the frame rate of the camera). From (6.103), we observe that with the Kalman filter the objective is to estimate, optimally the state at time t, filtering through the coefficient Kt (which is the unknown of the equation) the intrinsic uncertainties deriving from the prediction estimated in the state xˆ t−1 and from the new observed measures zt . In other words, the Kalman filter behaves like a data fusion process (prediction and observation) by optimally filtering the noise of that data. The key to this optimal process is reduced to the Kt calculation in each process state. Let us now analyze how the Kalman filter mechanism, useful for the tracking, realizes this optimal fusion between the assumption of state prediction of the system, the observed measures and the correction proposed by the Kalman Eq. (6.103). The state of the system at time t is described by the random variable xt , which evolves from the previous state t-1 according to the following linear equation: xt = Ft · xt−1 + Gt ut + ε t

(6.104)

where – xt is a state vector of size n x whose components are the variables that characterize the system (for example, position, velocity, . . .). Each variable is normally assumed to have a Gaussian distribution N (μ, ) with mean μ which is the center of the random distribution and covariance matrix  of size n x × n x . The correlation information between the state variables is captured by the covariance matrix  (symmetric) of which each element i j represents the level of correlation between the variable ith and jth. For the example of the ball tracking, the two variables p and v would be correlated, considering that the new position of the ball depends on velocity assuming no external influence (gravity, wind, . . .). – F t is the transition matrix that models the system prediction, that is, it is the state transition matrix that applies the effect of each system state parameter at time t-1 in the system state over time t (therefore, for the example considered, the position and velocity, over time t-1, both influence the new ball position at the time t). It should be noted that the components of F t are assumed to be constant in the changes of state of the system even if in reality they can be modified (for example, the variables of the system deviate with respect to the hypothesized Gaussian distribution). This last situation is not a problem since we will see that the Kalman filter will converge toward a correct estimate even if the distribution of the system variables deviate from the Gaussian assumption.

546

6 Motion Analysis

– ut is an input vector of size n u whose components are the system control input parameters that influence the state vector xt . For the ball tracking problem, if we wanted to consider the effect of the gravitational field, we should consider in the state vector xt , also the vertical component y of the position p = (x, y), which is dependent on the acceleration −g according to the gravitational motion equation y = 21 gt 2 . – G t is the control matrix associated with the input parameters ut . – εt is the vector (normally unknown, represents the uncertainty of the system model) that includes the terms of process noise for each parameter associated with the state vector xt . It is assumed that the process noise has a multivariate normal distribution with mean zero (white noise process) and with a covariance matrix Q t = E[εt ε tT ]. Equation (6.104) defines a linear stochastic process, where each value of the state vector xt is a linear combination of its previous value xt−1 plus the value of the control vector ut and the process noise εt . The equation associated with the observed measurements of the system, acquired in each state t, is given by (6.105) zt = Ht · xt + ηt where – zt is an observation vector of size n z whose components are the observed measurements at time t expressed in the sensor domain. Such measurements are obtained independently from the prediction model which in this example correspond to the coordinates of the position of the ball detected in the tth image of the sequence. – Ht is the transformation matrix (also known as the measure matrix), of size n z ×n x , which maps the state vector xt in the measurement domain. – ηt is the vector (normally unknown, models additional noise in the observation) of size n z whose components are the noise terms associated with the observed measures. As for the noise of the process εt , this noise is assumed as white noise, that is, with Gaussian distribution at zero mean and with covariance matrix Rt = E[ηt ηtT ]. It should be noted that ε t and ηt are independent variables, and therefore, the uncertainty on the prediction model does not depend on the uncertainty on the observed measures and vice versa. Returning to the example of the ball tracking, in the hypothesis of constant velocity, the state vector xt is given by the velocity vt and by the horizontal position xt = xt−1 + tvt−1 , as follows: 1 t xt−1 x + εt = F t−1 xt−1 + εt xt = t = (6.106) vt vt−1 0 1 For simplicity, the dominant dynamics of lateral motion was considered, with respect to the optical axis of the camera, thus considering only the position p = (x, 0) along

6.4 Optical Flow Estimation

547

the horizontal axis of the x-axis of the image plane (the height of the ball is neglected, that is, the y-axis as shown in Fig. 6.28). If instead we also consider the influence of gravity on the motion of the ball, in the state vector x two additional variables must be added to indicate the vertical fall motion component. These additional variables are the vertical position y of the ball and the velocity v y of free fall of the ball along the y axis. The vertical position is given by yt = yt−1 − tv yt−1 −

g(t)2 2

where g is the gravitational acceleration. Now let us indicate the horizontal speed of the ball with vx which we assume constant taking into account the high acquisition frame rate (t = 2ms for a frame rate of 500 fps). In this case, the status vector results in xt = (xt , yt , vxt , v yt )T and the linear Eq. (6.104) becomes ⎤ ⎡ xt 1 ⎢ yt ⎥ ⎢0 ⎥ ⎢ xt = ⎢ ⎣v x t ⎦ = ⎣ 0 v yt 0 ⎡

0 1 0 0

t 0 1 0

⎤ ⎡ ⎤ ⎤⎡ 0 0 xt−1 2 (t) ⎥ ⎢ ⎥ ⎢ t ⎥ ⎢ yt−1 ⎥ ⎢− 2 ⎥ ⎥ g + εt = F t−1 xt−1 + G t ut + εt (6.107) + 0 ⎦ ⎣vxt−1 ⎦ ⎣ 0 ⎦ v yt−1 1 t

with the control variable ut = g. For the ball tracking, the observed measurements are the (x, y) coordinates of the center of mass of the ball calculated in each image of the sequence through an algorithm of ball detection [23,24]. Therefore, the vector of measures represents the coordinates of the ball z = [x y]T and the equation of the observed measures, according to Eq. (6.105), is given by 1000 zt = (6.108) x + ηt = H t xt + ηt 0100 t Having defined the problem of the tracking of an object, we are now able to adapt it to the Kalman filter model. This model involves two distinct processes (and therefore two sets of distinct equations): update of the prediction and update of the observed measures. The equations for updating the prediction are xˆ t|t−1 = Ft · xˆ t−1|t−1 + Gt ut

(6.109)

P t|t−1 = Ft Pt−1|t−1 FtT + Qt

(6.110)

Equation (6.109), derived from (6.104), computes an estimate xˆ t|t−1 of the current state t of the system on basis of the previous state values t-1 with the prediction matrices Ft and Gt (provided with the definition of the problem), assuming known and Gaussian the distributions of the variables status xˆ t−1|t−1 and control ut . Equation (6.110) updates the system state prediction covariance matrix by knowing the covariance matrix Qt associated with the noise ε of the input control variables.

548

6 Motion Analysis

The variance associated with the prediction xˆ t|t−1 of an unknown real value xt is given by P t|t−1 = E[(xt − xˆ t|t−1 )(xt − xˆ t|t−1 )T ], where E(•) is the expectation value.8 The equations for updating the observed measurements are xˆ t|t = xˆ t|t−1 + Kt (z t − Ht xt|t−1 )

(6.112)

P t|t = Pt|t−1 − Kt Ht Pt|t−1 = (I − K t H t ) P t|t−1

(6.113)

where Kalman Gain is K t = Pt|t−1 HtT (Ht Pt|t−1 HtT + Rt )−1

(6.114)

8 Let’s

better specify how the uncertainty of a stochastic variable is evaluated, which we know to be its variance, to motivate Eq. (6.110). In this case, we are initially interested in evaluating the uncertainty of the state vector prediction xˆ t−1 which is given, being multidimensional, from its covariance matrix P t−1 = Cov( xˆ t−1 ). Similarly, the uncertainty of the next value of the prediction vector xˆ t at time t, after the transformation Ft obtained with (6.109), is given by P t = Cov( xˆ t ) = Cov(F t xˆ t−1 ) = Ft Cov( xˆ t−1 )FtT = Ft Pt−1 FtT

(6.111)

Essentially, the linear transformation of the values xˆ t−1 (which have a covariance matrix P t−1 ) with the prediction matrix F t modifies the covariance matrix of the vectors of the next state xˆ t according to Eq. (6.111), where P t in fact represents the output covariance matrix of the linear prediction transformation assuming the Gaussian state variable. In the literature, (6.111) represents the error propagation law for a linear transformation of a random variable whose mean and covariance are known without necessarily knowing its exact probability distribution. Its proof is easily obtained, remembering the definition and properties of the expected value μx and covariance x of a random variable x. In fact, considering a generic linear transformation y = Ax + b, the new expected value and the covariance matrix are derived from the original distribution as follows: μ y = E[ y] = E[ Ax + b] = AE[x] + b = Aμx + b  y = E[V ar ( y)V ar ( y)T ] = E[( y − E[ y])( y − E[ y])T ] = E[( Ax + b − AE[x] − b)( Ax + b − AE[x] − b)T ] = E[( A(x − E[x]))( A(x − E[x]))T ] = E[( A(x − E[x]))((x − E[x])T AT ] = AE[(x − E[x])(x − E[x])T ] AT = A AT In the context of the Kalman filter, the error propagation also occurs in the propagation of the measurement prediction uncertainty (Eq. 6.105) from the previous state xt|t−1 , whose uncertainty is given by the covariance matrix Pt|t−1 .

6.4 Optical Flow Estimation Fig. 6.29 Functional scheme of the iterative process of the Kalman filter

549 KALMAN FILTER Iterative process of Prediction | Correction Observed Measures State Update

Measure Update

(prediction)

(correction)

1- Calculate the Kalman Gain

1- Predict the next state

T

T

Kt=Pt|t-1Ht (HtPt|t-1Ht +Rt

xt|t-1=Ftxt-1|t-1+Gtut

2- Update the estimate of xt

2- Update the prediction matrix

using the observed measures zt

covariance estimate

xt|t=xt|t-1+Kt (zt -Htxt|t-1 )

Pt|t-1=FtPt-1|t-1Ft +Qt

3- Update the covariance matrix

Pt|t=Pt|t-1-Kt HtPt|t-1 t=t+1 Initialize state t

with Rt which is the covariance matrix associated with the noise of the observed measurements z t (normally known by knowing the uncertainty of the measurements of the sensors used) and Ht Pt|t−1 HtT is the covariance matrix of the measures that captures the propagated uncertainty of the previous state of prediction (characterized by the covariance matrix Pt|t−1 ) on the expected measures, through the transformation matrix Ht , provided from the model of the measures, according to Eq. (6.105). At this point, the matrices R and Q remain to be determined, starting from the initial values of x0 and P 0 , and start thus the iterative process of updating the status ( pr ediction) and updating of the observed measures (state correction), as shown in Fig. 6.29. We will now analyze the various phases of the iterative process of the Kalman filter reported in the diagram of this figure. In the prediction phase, step1, an a priori estimate xˆ t|t−1 is calculated, which is in fact a rough estimate made before the observation of the measures zt , that is before the correction phase (measures update). In step 2, the covariance matrix a priori of propagation of errors Pt|t−1 is computed with respect to the previous state t − 1. These values are then used in the update equations of the observed measurements. In the correction phase, the system state vector xt is estimated combining the knowledge information a priori with the measurements observed at the current time t thus obtaining a better updated and correct estimate of xˆ t and of Pt . These values are necessary in the prediction/correction for the future estimate at time t + 1. Returning to the example of the ball tracking, the Kalman filter is used to predict the region where the ball would be in each image of the sequence, acquired in real time with a high frame rate considering that the speed of the ball can reach 120 km/ h. The initial state t0 starts as soon as the ball appears in an image of the sequence, only initially searched for on the entire image (normally HD type with a resolution of 1920 × 1080). The initial speed v0 of the ball is estimated by processing multiple adjacent images of the sequence before triggering the tracking process. The accuracy

550

6 Motion Analysis

(a)

(b) Filtered measure combining prediction and observed measure

Predicted estimate

Observed measure with noise Predicted estimate

y x

x

x

Fig. 6.30 Diagram of the error filtering process between two successive states. a The position of the ball at the time t1 has an uncertain prediction (shown by the bell-shaped Gaussian pdf whose width indicates the level of uncertainty given by the variance) since it is not known whether external factors influenced the model of prediction. b The position of the ball is shown by the measure observed at time t1 with a level of uncertainty due to the noise of the measurements, represented by the Gaussian pdf of the measurements. Combining the uncertainty of the prediction model and the measurement one, that is, multiplying the two pdfs (prediction and measurements), there is a new filtered position measurement obtaining a more precise measurement of the position in the sense of the Kalman filter. The uncertainty of the filtered measurement is given by the third Gaussian pdf shown

of the position and initial speed of the ball is reasonably known and estimated in relation to the ball detection algorithm [24–26]. The next state of the ball (epoch t = 1) is estimated by the prediction update Eq. (6.109) which for the ball tracking is reduced to Eq. (6.106) excluding the influence of gravity on the motion of the ball. In essence, in the tracking process there is no control variable ut to consider in the prediction equation, and the position of the ball is based only on the knowledge of the state x0 = (x0 , v0 ) at time t0 , and therefore with the uncertainty given by the Gaussian distribution (xt ∼ N(F t xt−1 ; ). This uncertainty is due only to the calculation of the position of the ball which depends on the environmental conditions of acquisition of the images (for example, the lighting conditions vary between one state and the next). Furthermore, it is reasonable to assume less accuracy in predicting the position of the ball at the time t1 compared to the time t0 due to the noise that we propose to filter with the Kalman approach (see Fig. 6.30a). At the time t1 we have the measure observed on the position of the ball acquired from the current image of the sequence which, for this example, according to Eq. (6.105) of the observed measures, results:   xt + ηt zt = Ht · xt + ηt = 1 0 (6.115) vt where we assume the Gaussian noise ηt ∼ N(0; ηt ). With the observed measure z t , we have a further measure of the position of the ball whose uncertainty is given by the distribution z t ∼ N(μz ; σz2 ). An optimal estimate of the position of the ball is obtained by combining that of the prediction xˆ t|t−1 and that of the observed measure z t . This is achieved by multiplying the two Gaussian distributions together. The product of two Gaussians is still a Gaussian (see Fig. 6.30b). This is fundamental

6.4 Optical Flow Estimation

551

because it allows us to multiply an infinite number of Gaussian functions in epochs, but the resulting function does not increase in terms of complexity or number of terms. After each epoch, the new distribution is completely represented by a Gaussian function. This is the strategic solution according to the recursive property of the Kalman filter. The situation represented in the figure can be analytically described to derive the Kalman filter updating equations shown in the functional diagram of Fig. 6.29. For this purpose, we indicate for simplicity the Gaussian pdf related to the prediction with N(μx ; σx2 ) and with N(μz ; σz2 ) the one relative to the pdf of the observed measures. 1 √

− (x−μ2x )

2

2σx

Remembering the one-dimensional Gaussian function in the form e σx 2π and similarly for the z, executing the product of the two Gaussians we obtain a new Gaussian N(μc ; σc2 ) where the mean and variance of the product of two Gaussians are given by μx σz2 + μz σx2 σx2 (μz − μx ) = μ + (6.116) μc = x σx2 + σz2 σx2 + σz2

σc2 =

σx2 σz2 σx4 2 = σ − x σx2 + σz2 σx2 + σz2

(6.117)

These equations represent the updating equations at the base of the process of prediction/correction of the Kalman filter that was rewritten according to the symbolism of the iterative process and we have μxˆt|t =

μxˆt|t−1 σz2t + μz t σx2t|t−1 σx2t|t−1 + σz2t|t

= μxˆt|t−1 + (μz t − μxˆt|t−1 ) 

σx2t|t−1



σx2t|t−1 + σz2t|t  

(6.118)

K alman Gain

σxˆ2t|t

=



σx2t|t−1 σz2t σx2t|t−1 + σz2t

=

σx2t|t−1

− σx2t|t−1

σx2t|t−1 σx2t|t−1 + σz2t|t

(6.119)

By indicating with k the Kalman Gain, the previous equations are thus simplified: μxˆt|t = μxˆt|t−1 + k(μz t − μxˆt|t−1 )

(6.120)

σxˆ2t|t = σx2t|t−1 − kσx2t|t−1

(6.121)

The Kalman Gain and Eqs. (6.120) and (6.121) can be rewritten in matrix form to handle the multidimensional Gaussian distributions N(μ; ) given by K=

 xt|t−1  xt|t−1 +  z t|t

(6.122)

552

6 Motion Analysis

μ xˆ t|t = μ xˆ t|t−1 + K (μz t − μ xˆ t|t−1 )

(6.123)

 xˆ t|t =  xt|t−1 − K  xt|t−1

(6.124)

Finally, we can derive the general equations of prediction and correction in matrix form. This is possible considering the distribution of the prediction measures xˆ t given by (μ xˆ t|t−1 ;  xt|t−1 ) = (H t xˆ t|t−1 ; H t P t|t−1 H tT ) and the distribution of the observed measures z t given by (μz t|t ,  z t|t ) = (z t|t ; Rt ). Replacing these values of the prediction and correction distributions in (6.123), in (6.124), and in (6.122), we get (6.125) H t xˆ t|t = H t xˆ t|t−1 + K (z t − H t xˆ t|t−1 )

H t P t|t H tT = H t P t|t−1 H tT − K (H t P t|t−1 H tT )

K=

H t P t|t−1 H tT H t P t|t−1 H tT + Rt

(6.126)

(6.127)

We can now delete H t from the front of each term of the last three equations (remembering that one is hidden in the expression of K ), and H tT from Eq. (6.126), we finally get the following update equations: xˆ t|t = xˆ t|t−1 + K t (z t − H t xˆ t|t−1 )   

(6.128)

measur ement r esidual

P t|t = P t|t−1 − K t H t P t|t−1

(6.129)

K = P t|t−1 H tT (H t P t|t−1 H tT + Rt )−1   

(6.130)

covariance r esidual

(6.128) calculates, for the time t, the best new estimate of the state vector xˆ t|t of the system, combining the estimate of the prediction xˆ t|t−1 (calculated with the 6.109) with the r esidual (also known as innovation) given by the difference between observed measurements z t and expected measurements zˆ t|t = H t xˆ t|t−1 . We highlight (for Eq. 6.128) that the measurements residual is weighted by the Kalman gain K t , which establishes how much importance to give the r esidual with respect to the predicted estimate xˆ t|t−1 . We also sense the importance of K in filtering the r esidual. In fact, from Eq. (6.127), we observe that the value of K

6.4 Optical Flow Estimation

553

depends strictly on the values of the covariance matrices Rt and P t|t−1 . If Rt → 0 (i.e., the observed measurements are very accurate) (6.127) tells us that K t → H1 t and the estimate of xˆ t|t depends a lot on the observed measurements. If instead the prediction covariance matrix P t|t−1 → 0 this implies that H t P t|t−1 H tT → 0, and therefore dominates the covariance matrix of the observed measurements Rt and consequently we have K t → 0 and the filter mostly ignores measurements that are based instead on the prediction derived from the previous state (according to 6.128). We can now complete the example of ball tracking based on the Kalman filter to optimally control the noise associated with the dynamics of the ball moving at high speed. The configuration of the acquisition system is the one shown in Fig. 6.28, where a high frame rate camera (400 fps) continuously acquires 1920 × 1080 pixel image sequences. As shown in the figure, it is interesting to estimate the horizontal trajectories of the ball over time, analyzing the images of the sequence in which the 1 s´. The acquisition system temporal distance between consecutive images is t = 400 (camera-optics) has such a field angle that it observes an area of the scene that in the horizontal direction corresponds to an amplitude of about 12 m. Considering the horizontal resolution of the camera, the position of the ball is estimated with the resolution of 1 pixel which corresponds in the spatial domain to 10 mm. The prediction calculating the position xt|t−1 is given by (6.106) with   for   equation 1 1/400 = that represents the dynamics of the system, while the matrix F = 01 t 1 0 1 the state vector of the system is given by xt = [xt vt ]T where vt is the horizontal speed of the ball, assumed constant, considering the high frame rate of the sequence of acquired images (this also justifies excluding the gravitational effect on the ball). The error εt of the motion model is assumed to be constant in time t. The initial state of the system xt0 = [xt0 vt0 ]T is estimated as follows. An algorithm of ball detection, as soon as the ball appears in the scene, detects it (analyzing the whole image) in a limited number of consecutive images and determines the initial state (position and velocity) of the system. The initial estimate of the speed may not be very accurate compared to the real one. In this context, (6.106) defines the motion equation: 1 t xˆt−1|t−1 = xˆt−1|t−1 + vt−1 t, xˆ t|t−1 = F xˆ t−1|t−1 = vt−1 0 1 where vt−1 is not measured. The covariance matrix of the initial state,  which repre σ2 0

sents the uncertainty of the state variables, results in P t0 = 0s σ 2 , where σs is the s standard deviation of the motion model associated only with the horizontal position of the ball while assuming zero for velocity (measure not observed). The control variables u and its control  matrix G are both null in this example, as is the covariance matrix Q t = 00 00 which defines the uncertainty of the model. It is, therefore, assumed that the ball does not suffer significant slowdowns in the time interval of a few seconds during the entire tracking.

554

6 Motion Analysis

The equation of measures (6.105) becomes   xˆt|t−1 = xˆt|t−1 + ηt−1 , z t = H t xˆ t|t−1 + ηt−1 = 1 0 vt−1 where the measurement matrix has only one nonzero element since only the horizontal position is measured while H (2) = 0 since the speed is not measured. Measurements noise η is controlled by the covariance matrix R, which in this case is a scalar r , associated only with the measure xt . The uncertainty of the measurements is modeled as Gaussian noise controlled with r = σm2 assuming constant variance in the update process. The simplified update equations are Kt =

P t|t−1 H T H P t|t−1 H T + r

P t|t = P t|t−1 − K t H P t|t−1 xˆ t|t = xˆ t|t−1 + K t (z t − H xˆ t|t−1 ) Figure 6.31 shows the results of the Kalman filter for the ball tracking considering only the motion in the direction of the x-axis. The ball is assumed to have a speed of 80 km/h = 2.2222 · 104 mm/s. The uncertainty of the motion model (with constant speed) is assumed to be null (with covariance matrix Q = 0) and any slowing down or acceleration of the ball are assumed as noise. The covariance matrix P t checks the error due to the process in each time t and indicates whether we should give more weight to the new measurement or to the estimation of the model according to Eq. (6.129). In this example, assuming a zero-noise motion model the state of the system is controlled by the variance of the state variables reported in the terms of the main diagonal of P t . Previously, we indicated with σs2 the variance (error) of the variable xt and the confidence matrix of the model P t is predicted, in every time, based on the previous value, through (6.110). The Kalman filter results shown in the figure are obtained with initial values of σs = 10 mm. Measurements noise is characterized instead with standard deviation σm = 20 mm (note that in this example the units of the measurements of the state variables and observed measurements are homogeneous, expressed in mm). The Kalman filter was applied with an initial value of the wrong speed of 50% (graphs of the first row) and of 20% (graphs second row) with respect to the real one. In the figure, it is observed that the filter converges, however, toward the real values of the speed even if in different epochs in relation to the initial error. Convergence occurs with the action of error filtering (of the model and measurements) and it is significant to analyze the qualitative trend of the P matrix and of the gain K (which asymptotically tends to a minimum value), as the σs2 and σm2 variances vary. In general, if the variables are initialized with significant values, the filter converges faster. If the model corresponds well to a real situation the state of the system is

6.4 Optical Flow Estimation 4

4

3

True position Observed measure Estimated position with KF

5 4 3 2 1 0

Estimated velocity

x 10

True velocity Average velocity estimated Estimated velocity with the KF

2.5

Velocity (mm/s)

X axis position (mm)

Estimated position

x 10

6

555

2 1.5 1 0.5

0

0.5

1

1.5

2

0

2.5

0.5

0

Time (1/fps s) Estimated position

4

x 10

5

2

2.5

x 10

True velocity Average velocity estimated Estimated velocity with the KF

2.5

4 3 2 1 0 0

1.5

Estimated velocity

4

3

True position Observed measure Estimated position with KF

Velocity (mm/s)

X axis position (mm)

6

1

Time (1/fps s)

2 1.5 1 0.5

0.5

1

1.5

Time (1/fps s)

2

2.5

0

0

0.5

1

1.5

2

2.5

Time (1/fps s)

Fig.6.31 Kalman filter results for the ball tracking considering only the dominant horizontal motion (x-axis) and neglecting the effect of gravity. The first column shows the graphs of the estimated position of the ball, while in the second column the estimated velocities are shown. The graphs of the two lines refer to two initial velocity starting conditions (in the first line the initial velocity error is very high at 50% while in the second line it is 20%), where it is observed as after a short time the initial speed error is quickly filtered by the Kalman filter

well updated despite the presence of measures observed with considerable error (for example, 20−30% of error). If instead, the model does not reproduce a real situation well even with measurements with not very noise the state of the system presents a drift with respect to the true measures. If the model is poorly defined, there will be no good estimate. In this case, it is worth trying to make the model weigh less by increasing the estimated error. This will allow the Kalman filter to rely more on measurement values while still allowing some noise removal. In essence, it would be convenient to set the measurements error η and verify the effects on the system. Finally, it should be noted that the gain K tends to give greater weight to the observed measures if it has high values, on the contrary, it weighs more the prediction model if it has low values. In real applications, the filter does not always achieve the optimality conditions provided by the theory, but the filter is used anyway, giving acceptable results, in

556

6 Motion Analysis

various tracking situations, and in general, to model the dynamics of systems based on prediction/correction to minimize the covariance of the estimated error. In the case of the ball tracking it is essential, in order to optimally predict ball position, to significantly reduce the ball searching region, and consequently appreciably reduce the search time (essential in this tracking application context which requires to process several hundred images per second) of the ball by the ball detection algorithm. For nonlinear dynamic models or nonlinear measurement models, the Extended Kalman Filter (EKF)[9,27] is used, which solves the problem (albeit not very well) by applying the classical Kalman filter to the linearization of the system around the current estimate.

6.5 Motion in Complex Scenes In this section, we will describe some algorithms that continuously (in real time) detect the dynamics of the scene characterized by the different entities in motion and by the continuous change of environmental conditions. This is the typical situation that arises for the automatic detection of complex sporting events (soccer, basketball, tennis, . . .) where the entities in motion can reach high speeds (even hundreds of km/h) in changing environmental conditions and the need to detect the event in real time. For example, the automatic detection of the offside event in football would require the simultaneous tracking of different entities, therefore recognizing the class to which it belongs (ball, the two goalkeepers, player team A and B, referee and assistants), to process the event data (for example, player who has the ball, his position and that of the other players at the moment he hits the ball, player who receives the ball) and make the decision in real time (in a few seconds) [3]. In the past, technological limits of vision and processing systems prevented the possibility of realizing vision machines for the detection, in real time, of such complex events under changing environmental conditions. Many of the traditional algorithms of motion analysis and object recognition fail in these operational contexts. The goal is to find robust and adaptive solutions (with respect to changing light conditions, recognizing dynamic and static entities, and arbitrary complex configurations of multiple moving entities) by choosing algorithms with adequate computational complexity and immediate operation. In essence, algorithms are required that automatically learn the initial conditions of the operating context (without manual initialization) and automatically learn how the conditions change. Several solutions are reported in the literature (for tracking people, vehicles, . . .) which are essentially based on fast and approximate methods such as the background subtraction—BS which are the direct way to detect and trace the motion of moving entities of the scene observed by stationary vision machines (with frame rates even higher than the standard of 25 fps) [28–30]. Basically, the BS methods label the “dynamic pixels” at time t whose gray level or color information changes significantly compared to the pixels belonging to the background. This simple and fast method, valid in the context of a stationary camera, is not always valid especially when the

6.5 Motion in Complex Scenes

557

light conditions change and in all situations when the signal-to-noise ratio becomes unacceptable, also due to the noise inherent to the acquisition system. This involved the development of some backgr ound models also based on statistical approaches [29,31] to mitigate the instability of simple BS approaches. Therefore, these new BS methods must not only robustly model the noise of the acquisition systems but must also adapt to the rapid change in environmental conditions. Another strategic aspect concerns shadow management and temporary occlusions of moving entities compared to backgr ound. It is also highlighted the need to process the diversity of recorded video sequences (video broadcast) from video sequences acquired in real time which has an impact on the types of BS algorithms to be used. We now describe the most common BS methods.

6.5.1 Simple Method of Background Subtraction Several BS methods are proposed that are essentially based on the assumption that the sequence of images is acquired in the context of a stationary camera that observes a scene with stable background B with respect to which one sees objects in motion that normally have an appearance (color distribution or gray levels) distinguishable from B. These moving pixels represent the foreground, or regions of interest of ellipsoidal or rectangular shape (also known as blob, bounding box, cluster, . . .). The general strategy, which distinguishes the pixels of moving entities (vehicle, person, . . .) from the static ones (unchangeable intensity) belonging to the background, is shown in Fig. 6.32. This strategy involves the continuous comparison between the current image and the background image, the latter appropriately updated through a model that takes into account changes in the operating context. A general expression that for each pixel (x, y) evaluates this comparison between current background B and image I t at time t is the following:  1 if d[I t (x, y), B(x, y)] > τ D(x, y) = (6.131) 0 other wise where d[•] represents a metric to evaluate such a comparison, τ is a threshold value exceeded which the pixel (x, y) is labeled with “1” to indicate that it belongs to an object (“0” non-object), and D(x, y) represents the image-mask of the pixels in motion. The various BS methods are distinguished by the type of metric and its background model used. A very simple BS method is based on the absolute value of the difference between the background image and the current image Dτ (x, y) = |I t (x, y)− B(x, y)|, where we assume a significant difference in appearance (in terms of color or levels of gray) between background and moving objects. A simple background model is obtained by updating it with the previous image B(x, y) = I t−1 (x, y) and Eq. (6.131) becomes  1 if |I t (x, y) − B t−1 (x, y)] > τ Dt,τ (x, y) = (6.132) 0 other wise

558 Fig. 6.32 Functional scheme of the process of detecting moving objects based on the simple method of subtraction of the background. The current frame of the sequence is compared to the current model of static objects (the background) to detect moving objects (foreground)

6 Motion Analysis

Current Frame

Compare

Detected Objects

Background Model

Background Update

The background is then updated with the last image acquired B t (x, y) = I t (x, y) to be able to reapply (6.132) for subsequent images. When a moving object is detected and then stops, with this simple BS method, the object disappears into Dt,τ . Furthermore, it is difficult to detect it and recognize it when the dominant motion of the object is not lateral (for example, if it moves away or gets closer than the camera). The results of this simple method depend very much on the threshold value adopted which can be chosen manually or automatically by previously analyzing the histograms of both background and object images.

6.5.2 BS Method with Mean or Median A first step to get a more robust background is to consider the average or the median of the n previous images. In essence, an attempt is made to attenuate the noise present in the background due to the small movement of objects (leaves of a tree, bushes, . . .) that are not part of the objects. Applies a filtering operation based on the average or median (see Sect. 9.12.4 Vol. I) of the n previous images. With the method of the media, the background is modeled with the arithmetic mean of the n images kept in memory: n−1 1 I t−i (x, y) (6.133) B t (x, y) = n i=0

where n is closely related to the acquisition frame rate and object speed. Similarly, the background can be modeled with the median filter for each pixel of all temporarily stored images. In this case, it is assumed that each pixel has a high

6.5 Motion in Complex Scenes

559

probability of remaining static. The estimation of the background model is given by B t (x, y) = median{I t−i (x, y)}   

(6.134)

i∈{0,1,...,n−1}

The mask image for both the mean and the median results:  1 if |I t (x, y) − B t (x, y)] > τ Dt,τ (x, y) = 0 other wise

(6.135)

Appropriate n and frame rate values produce a correct updated background and a realistic foreground mask of moving objects, with no phantom or missing objects. These methods are among the nonrecursive adaptive updating techniques of the background, in the sense that they depend only on the images stored and maintained in the system at the moment. Although easy to make and fast, they have the drawbacks of nonadaptive methods, meaning that they can only be used for short-term tracking without significant changes to the scene. When the error occurs it is necessary to reinitialize the background otherwise the errors accumulate over time. They also require an adequate memory buffer to keep the last n images acquired. Finally, the choice of the global threshold can be problematic.

6.5.2.1 BS Method Based on the Moving Average To avoid keeping a history of the last n frames acquired, the background can be updated parametrically with the following formula: B t (x, y) = α I t (x, y) + (1 − α)B t−1 (x, y)

(6.136)

where the parameter α, seen as a learning parameter (assumes a value between 0.01 and 0.05), models the update of the background B t (x, y) at time t, weighing the previous value B t−1 (x, y) and the current value of the image I t (x, y). In essence, the current image is immersed in the model image of the background via the parameter α. If α = 0, (6.136) is reduced to B t (x, y) = B t−1 (x, y), the background remains unchanged, and the mask image is calculated by the simple subtraction method (6.133). If instead, α = 1, (6.136) is reduced to B t (x, y) = I t (x, y) producing the simple difference between images.

6.5.3 BS Method Based on the Moving Gaussian Average This method [29] proposes approximating the distribution of the values of each pixel in the last n images with a Gaussian probability density function (unimodal). Therefore, in the hypothesis of gray-level images, two maps are maintained, one for the average and one for the standard deviation. In the initialization phase the two maps are created background μt (x, y) and σ t (x, y) to characterize each pixel with

560

6 Motion Analysis

its own pd f with the parameters, respectively, of the average μt (x, y) and of the variance σt2 (x, y). The maps of the original background are initialized by acquiring the images of the scene without moving objects and calculating the average and the variance for each pixel. To manage the changes of the existent background, due to the variations of the ambient light conditions and to the motion of the objects, the two maps are updated in each pixel for each current image at the time t, calculating the moving average and the relative mobile variance, given by μt (x, y) = α I t (x, y) + (1 − α)μt−1 (x, y) σ 2t (x, y) = d 2 α + (1 − α)σ 2t−1 (x, y)

(6.137)

with d(x, y) = |I t (x, y) − μt (x, y)

(6.138)

where d indicates the Euclidean distance between the current value It (x, y) of the pixel and its average, α is the learning parameter of the background update model. Normally α = 0.01 and as evidenced by (6.137) it tends to weigh little the value of the current pixel It (x, y) if this is classified as foreground to avoid merging it in the background. Conversely, a pixel classified as background the value of α should be chosen based on the need for stability (lower value) or fast update (higher value). Therefore, the current average in each pixel is updated based on the weighted average of the previous values and the current value of the pixel. It is observed that with the adaptive process given by (6.137) the values of the average and of the variance of each pixel are accumulated, requiring little memory and having high execution speed. The pixel classification is performed by evaluating the absolute value of the difference between the current value and the current average of the pixel with respect to a confidence value of the threshold τ as follows:  |I (x,y)−μ (x,y)| For egr ound if t σ t (x,y)t >τ (6.139) It = Backgr ound otherwise The value of the threshold depends on the context (good results can be obtained with τ = 2.5) even if it is normally chosen less than a factor k of the standard deviation τ = kσ . This method is easily applicable also for color images [29] or multispectral images maintaining two background maps for each color or spectral channel. This method has been successfully experimented for indoor applications with the exception of cases with multimodal background distribution.

6.5.4 Selective Background Subtraction Method In this method, a sudden change of a pixel is considered as foreground and the background model is ignored (background not updated). This prevents the background model from being modified by pixels that are not inherently belonging to the

6.5 Motion in Complex Scenes

561

background. The direct formula for classifying a pixel between moving object and background is the following:  It =

For egr ound if |I t (x, y) − I t−1 (x, y)| > τ (no background update) Backgr ound otherwise

(6.140)

The methods based on mobile average, seen in the previous paragraphs, are actually also selective.

6.5.5 BS Method Based on Gaussian Mixture Model (GMM) So far methods have been considered where the background update model is based on the recent pixel history. Only with the Gaussian moving average method was the background modeled with the statistical parameters of average and variance of each pixel of the last images with the assumption of a unimodal Gaussian distribution. No spatial correlation was considered with the pixels in the vicinity of the one being processed. To handle more complex application contexts where the background scene includes structures with small movements not to be regarded as moving objects (for example, small leaf movements, trees, temporarily generated shadows, . . .) different methods have been proposed based on models of background with multimodal Gaussian distribution [31]. In this case, the value of a pixel varies over time as a stochastic process instead of modeling the values of all pixels as a particular type of distribution. The method determines which Gaussian a background pixel can correspond to. The values of the pixels that do not adapt to the background are considered part of the objects, until it is associated with a Gaussian which includes them in a consistent and coherent way. In the analysis of the temporal sequence of images, it happens that the significant variations are due to the moving objects compared to the stationary ones. The distribution of each pixel is modeled with a mixture of K Gaussians N(μi,t (x, y),  i,t (x, y)). The probability P of the occurrence of an RGB pixel in the location (x, y) of the current image t is given by P(I i,t (x, y)) =

K 

ωi,t (x, y)N(μi,t (x, y),  i,t (x, y))

(6.141)

i=1

where ωi,t (x, y) is the weight of the ith Gaussian. To simplify, as suggested by the author, the covariance matrix  i,t (x, y) can be assumed to be diagonal and in this 2 (x, y)I, where I is the matrix 3×3 identity in the case case we have  i,t (x, y) = σi,t of RGB images. The K number of Gaussians depends on the operational context and the available resources (in terms of calculation and memory) even if it is normally between 3 and 5. Now let’s see how the weights and parameters of the Gaussians are initialized and updated as the images I t are acquired in real time. By virtue of (6.141), the distribution of recently observed values of each pixel in the scene is characterized by

562

6 Motion Analysis

the Gaussian mixture. With the new observation, i.e., the current image I t , each pixel will be associated with one of the Gaussian components of the mixture and must be used to update the parameters of the model (the Gaussians). This is implemented as a kind of classification, for example, the K-means algorithm. Each new pixel I t (x, y) is associated with the Gaussian component for which the value of the pixel is within 2.5 standard deviations (that is, the distance is less than 2.5σi ) of its average. This 2.5 threshold value can be changed slightly, producing a slight impact on performance. If a new It (x, y) pixel is associated with one of the Gaussian distributions, the relative 2 (x, y) are updated as follows: parameters of the average μi,t (x, y) and variance σi,t μi,t (x, y) = (1 − ρ)μi,t−1 (x, y) + ρ It (x, y) 2 2 (x, y) = (1 − ρ)σi,t−1 (x, y) + ρ[It (x, y) − μi,t (x, y)]2 σi,t

(6.142)

while the previous weights of all the Gaussians are updated as follows: ωi,t (x, y) = (1 − α)ωi,t−1 (x, y) + α Mi,t (x, y)

(6.143)

where α is the user-defined learning parameter, ρ is a second learning parameter 2 defined as ρ = αN(It |μi,t−1 (x, y), σi,t−1 (x, y)), and Mi,t (x, y) = 1 indicates that the It (x, y) pixel is associated with the Gaussian (for which the weight is increased) while it is zero for all the others (for which the weight is decreased). If, on the other hand, It (x, y) is not associated with any of the Gaussian mixtures, It (x, y) is considered a foreground pixel, the least probable distribution is replaced by a new distribution with the current value as a mean value (μt = It ), initialized with a high variance and a low weight value. The previous weights of the K Gaussians are updated with (6.143) while the parameters μ and σ of the same Gaussians remain unchanged. The author defines the least probable distribution with the following heuristic: the Gaussians that have a greater population of pixels and the minimum variance should correspond to the background. With this assumption, the Gaussians are ordered with respect to the ratio ωσ . In this way, this ratio increases when the Gaussians have broad support and as the variance decreases. Then the first B distributions are simply chosen as a background template: B = arg min b

b #

ωi > T

$ (6.144)

i=1

where T indicates the minimum portion of the image that should be background (characterized with distribution with high value of weight and low variance). Slowly moving objects take longer to include in the background because they have more variance than the background. Repetitive variations are also learned and a model is maintained for the distribution of the background, which leads to faster recovery when objects are removed from subsequent images. The simple BS methods (difference of images, average and median filtering), although very fast, using a global threshold to detect the change of the scene are

6.5 Motion in Complex Scenes

563

inadequate in complex real scenes. A method that models the background adaptively with a mixture of Gaussians better controls real complex situations where often the background is bimodal with long-term scene changes and confused repetitive movements (for example caused by the temporary overlapping of objects in movement). Often better results are obtained by combining the adaptive approach with temporal information on the dynamics of the scene or by combining local information deriving from simple BS methods.

6.5.6 Background Modeling Using Statistical Method Kernel Density Estimation A nonparametric statistical method for modeling the background [32] is based on the Kernel Density Estimation (KDE), where the pd f multimodal distribution is estimated directly from the data without any knowledge of the intrinsic data distribution. With the KDE approach it is possible to estimate the pd f by analyzing the last partial data of a stochastic distribution through a kernel function (for example, the uniform and normal distribution) with the property of being nonnegative, positive, and even values, with the integral equal to 1, defined on the support interval. In essence, for each data item, a kernel function is created with the data centered, thus ensuring that the kernel is symmetrical. The pd f is then estimated by adding all the normalized kernel functions with respect to the number of data to satisfy the properties of the kernel function to be nonnegative and with the integral equal to 1 on the entire definition support. The simplest kernel function is the uniform rectangular function (also known as the Parzen window), where all the data have the same weight without considering their distance from the center with zero mean. Normally, the Gaussian kernel function K (radially symmetric and unimodal) is used instead, which leads to the following estimate of the pd f distribution of the background, given by Pkde (I i,t (x, y)) =

t−1  1   K I t (x, y) − I i (x, y) n

(6.145)

i=t−n

where n is the number of the previous images used to estimate the pd f distribution of the background using the Gaussian kernel function K . A It (x, y) pixel is labeled as background if Pkde (It (x, y)) > T , where T is a default threshold otherwise it is considered a pixel foreground. The T threshold is appropriately adapted in relation to the number of false positives acceptable for the application context. The KDE method is also extended for multivariate variable and is immediately usable for multispectral or color images. In this case, the kernel function is obtained from the product of one-dimensional kernel functions and (6.145) becomes

564

6 Motion Analysis

Pkde (Ii,t (x, y)) =

 ( j) ( j) t−1 m I t (x, y) − I i (x, y) 1  K n σj

(6.146)

i=t−n j=1

where I represents the image to m components (for RGB images, m = 3) and σ j is the standard deviation (the default parameter of smoothing that controls the difference between estimated and real data) associated with each component kernel function that is assumed to be all Gaussian. This method attempts to classify the pixels between background and foreground by analyzing only the last images of the time sequence, forgetting the past situation and updating the pixels of the scene considering only the recent ones of the last n observations.

6.5.7 Eigen Background Method To model the background, it is possible to use the principal component analysis (PCA, introduced in Sect. 2.10.1 Vol. II) proposed in [33]. Compared to the previous methods, which processed the single pixel, with the PCA analysis the whole sequence of n images is processed to calculate the eigenspace that adaptively models the background. In other words, with the PCA analysis the background is modeled globally resulting more stable than the noise and reducing the dimensionality of the data. The eigen background method involves two phases: Learning phase. The pixels I j of the image ith of the sequence are organized in a column vector I i = {I1,i , . . . , I j,i · · · , I N ,i } of size N × 1 which allocates all N pixels of the image. The entire sequence of images is organized in n columns in the n I i is calculated. matrix I of size N × n of which the image average μ = n1 i=1 Then, the matrix X = [X 1 , X 2 , . . . , X n ] of size N × n is calculated, where each of its column vector (image) has mean zero given by X i = I i − μ. Next, the covariance matrix C = E{X i , X iT } ≈ n1 X X T is calculated. By virtue of the PCA transform it is possible to diagonazlize the covariance matrix C calculating the eigenvector matrix  by obtaining D = CT

(6.147)

where D is the diagonal matrix of size n × n. Always according to the PCA transform, we can consider only the first m < n principal components to model the background in a small space, projecting into that space, with the first m eigenvectors m (associated with the largest m eigenvalues) the new images I t acquired at time t. Test phase. Defined the eigen background and the average image μ a new column image I t is projected into this new reduced eigenspace through the eigenvector matrix m of size n × m obtaining B t = m (I t − μ)

(6.148)

6.5 Motion in Complex Scenes

565

where B t represents the projection with zero mean of I t in the eigenspace described by m . The reconstruction of I t from the reduced eigenspace in the image space is given by the following inverse transform: T B t = m Bt + μ

(6.149)

At this point, considering that the eigenspace described by m mainly models static scenes and not dynamic objects, the image Bt reconstructed by the autospace does not contain moving objects that can be highlighted instead, comparing with a metric (for example, the Euclidean distance d2 ), the input image I t and the reconstructed one Bt are as follows:  For egr ound if d2 (I t , Bt ) > T F t (x, y) = (6.150) Backgr ound otherwise where T is a default threshold value. As an alternative to the PCA principal components, to reduce the dimensionality of the image sequence I and model the background, the SVD decomposition of I can be used, given by I = UV T

(6.151)

where U is the orthogonal matrix of size N × N and V T is an orthogonal matrix of size n × n. The singular values of the image sequence I are contained in the diagonal matrix  in descending order. Considering the strong correlation of the images of the sequence, it will be observed that the first m singular values other than zero will be few or m  n. It follows that the first m columns of U m can be considered as the basis of orthogonal vectors equivalent to m to model the background and create the reduced space where to project input images I t , and then detecting objects as foreground as done with the previous PCA approach. The eigen background approach based on PCA analysis or SVD decomposition remains problematic when it is necessary to continuously update the background, particularly when it needs to process a stream of video images, even if the SVD decomposition is faster. Other solutions are proposed [34], for example, by adaptively updating the eigen background and making the detection of foreground objects more effective.

6.5.8 Additional Background Models In the parametric models, the probability density distribution (pdf) of the background pixels is assumed to be known (for example, a Gaussian) described by its own characteristic parameters (mean and variance). A semiparametric approach used to model the variability of the background is represented for example by the Gaussian mixture as described above. A more general method used, in different applications (in partic-

566

6 Motion Analysis

ular in the Computer Vision), consists instead of trying to estimate the pd f directly by analyzing the data without assuming a particular form of their distribution. This approach is known as the nonparametric estimate of the distribution, for example, the simplest one is based on the calculation of the histogram or the Parzen window (see Sect. 1.9.4), known as kernel density estimation. Among the nonparametric approaches is the mean-shift algorithm (see Sect. 5.8.2 Vol. II), an iterative method of ascending the gradient with good convergence properties that allows to detect the peaks (modes) of a multivariate distribution and the related covariance matrix. The algorithm was adopted as an effective technique for both the blob tracking and for the background modeling [35–37]. Like all nonparametric models, we are able to model complex pd f , but the implementation requires considerable computational and memory resources. A practical solution is to use the mean-shift method only to model the initial background (the pd f of the initial image sequence) and to use a propagation method to update the background model. This strategy is proposed in [38], which propagates and updates the pd f with new images in real time.

6.6 Analytical Structure of the Optical Flow of a Rigid Body In this paragraph, we want to derive the geometric relations that link the motion parameters of a rigid body, represented by a flat surface, and the optical flow induced in the image plane (observed 2D displacements of intensity patterns in the image), hypothesized corresponding to the motion field (projection of 3D velocity vectors on the 2D image plane). In particular, given a sequence of space-time-variant images acquired while objects of the scene move with respect to the camera or vice versa, we want to find solutions to estimate: 1. the 3D motion of the objects with respect to the camera by analyzing the 2D flow field induced by the sequence of images; 2. the distance to the camera object; 3. the 3D structure of the scene. As shown in Fig. 6.33a, the camera can be considered stationary and the object in motion with speed V or vice versa. The optical axis of the camera is aligned with the Z -axis of the reference system (X, Y, Z ) of the camera, with respect to which the moving object is referenced. The image plane is represented by the plane (x, y) perpendicular to the Z -axis at the distance f , where f is the focal point of the optics. In reality, the optical system is simplified with the pinhole model and the focal distance f is the distance between the image plane and the perspective projection center located at the origin O of the reference system (X, Y, Z ). A point P = (X, Y, Z ) of the object plane, in the context of perspective projection, is projected in the image plane at the point p = (x, y) calculated with the perspective projection

6.6 Analytical Structure of the Optical Flow of a Rigid Body

(a)

(b)

Y y

p(x,y)

x

V

f v

567

COP

P(X,Y,Z)

y

Z

p

Y V

f Z

P(Y,Z)

Z

X Fig. 6.33 Geometry of the perspective projection of a 3D point of the scene with model pinhole. a Point P in motion with velocity V with respect to the observer with reference system (X, Y, Z ) in the perspective center of projection (CoP), with the Z -axis coinciding with the optical axis and perpendicular to the image plane (x, y); b 3D relative velocity in the plane (Y, Z ) of the point P and the 2D velocity of its perspective projection p in the image plane (visible only y-axis)

equations (see Sect. 3.6 Vol. II), derived with the properties of similar triangles (see Fig. 6.33b), given by X Z

x= f

y= f

Y Z

=⇒

p= f

P Z

(6.152)

Now let’s imagine the point P(t) in motion at the velocity V = dP dt that after time t moves in the 3D position in P(t + t). The perspective projection of P at time t in the image plane is in p(t) = (x(t), y(t)) while in the next image after time (t + t) is shifted to p(t + t). The apparent velocity of P in the image plane is indicated with v given by the components: vx =

dx dt

vy =

dy dt

(6.153)

These components are precisely those that generate the image motion field or represent the 2D velocity vector v = (vx , v y ) of p, perspective projection of the velocity vector V = (Vx , Vy , Vz ) of the point P in motion. To calculate the velocity v in the image plane, we can differentiate with respect to t (using the derivative quotient rule) the perspective Eq. (6.152) which, expressed in vector terms results in p = f PZ , we get the following: d fZP(t) Z V − Vz P dp(t) (t) = = f v= dt dt Z2

(6.154)

whose components are vx =

f Vx − x Vz Z

vy =

f Vy − yVz Z

(6.155)

while vz = 0 = f VZz − f VZz . From (6.154), it emerges that the apparent velocity is a function of the velocity V of the 3D motion of P and of its depth Z with respect to the image plane. We can reformulate the velocity components in terms of a perspective

568

6 Motion Analysis

projection matrix given by ⎡ ⎤ V vx f 0 −x 1 ⎣ x ⎦ Vy = vy 0 f −y Z Vz

(6.156)

The relative velocity of the P point with respect to the camera, in the context of a rigid body (where all the points of the objects have the same parameters of motion), can also be described in terms of instantaneous rectilinear velocity T = (Tx , Ty , Tz )T and angular = ( x , y , z )T (around the origin) from the following equation [39,40]: V=T+ ×P (6.157) where the “×” symbol indicates the vector product. The components of V are Vx = Tx + y Z − Y z V y = Ty − x Z + X z

(6.158)

Vz = Tz + x Y − X y In matrix form the relative velocity V = (Vx , Vy , Vz ) of P = (X, Y, Z ) is given by ⎡ ⎤ Tx ⎡ ⎤ ⎡ ⎤ ⎢ Ty ⎥ ⎥ Vx 1 0 0 0 Z −Y ⎢ ⎢ Tz ⎥ ⎢ ⎣ ⎦ ⎣ ⎦ (6.159) V = Vy = 0 1 0 −Z 0 X ⎢ ⎥ x ⎥ ⎥ 0 0 1 Y −X 0 ⎢ Vz ⎣ y ⎦ z The instant velocity v = (vx , v y ) of p = (x, y) can be computed by replacing Eq. (6.156) of the relative velocity of V obtaining ⎡ ⎤ ⎡ ⎤ T 1 f 0 −x ⎣ x ⎦ 1 x · y −( f 2 + x 2 ) f · y ⎣ x ⎦ vx Ty + y = v= vy −x · y −f ·x Z 0 f −y f f 2 + y2 Tz z       Translational component

(6.160)

Rotational component

from which it emerges that the perspective motion of P in the image plane in p induces a flow field v produced by the linear composition of the translational and rotational motion. The translational flow component depends on the distance Z of the point P which does not affect the rotational motion. For a better readability of the flow induced in the image plane by the different possibilities of motion (6.160),

6.6 Analytical Structure of the Optical Flow of a Rigid Body Fig. 6.34 Geometry of the perspective projection of 3D points of a moving plane

569

Y

n

y

d

f

x

p

COP

P Z

X we can rewrite it in the following form: vx =

y x 2 Tx f − Tz x x x y − y f + z y + − Z f f       Transl. comp.

Rotational component

Ty f − Tz y y x y x x 2 − vy = − x f + z x + Z f f       Transl. comp.

(6.161)

Rotational component

We have, therefore, defined the model of perspective motion for a rigid body, assuming zero optical distortions, which relates to each point of the image plane the apparent velocity (motion field) of a 3D point of the scene at a distance Z , subject to the translation motion T and rotation . Other simpler motion models can be considered as the weak perspective model, orthographic or affine. From the analysis of the motion field, it is possible to derive some parameters of the 3D motion of the objects. In fact, once the optical flow is known (vx , v y ) with Eq. (6.161) we would have for each point (x, y) of the image plane two bilinear equations in 7 unknowns: the depth Z , the 3 translational velocity components T and the 3 angular velocity components . The optical flow is a linear combination of T and once known Z or it results a linear combination of the inverse depth 1/Z and once is known the translational velocity T. Theoretically, the 3D structure of the object (the inverse of the depth 1/Z for each image point) and the motion components (translational and rotational) can be determined by knowing the optical flow for different points in the image plane. For example, if the dominant surface of an object is a flat surface (see Fig. 6.34) it can be described by (6.162) P · nT = d where d is the perpendicular distance of the plane from the origin of the reference system, P = (X, Y, Z ) is a generic point of the plane, and n = (n x , n y , n z )T is the normal vector to the flat surface as shown in the figure. In the hypothesis of translatory and rotatory motion of the flat surface with respect to the observer (camera), the normal n and the distance d vary in time. Using Eq. (6.152) of the perspective projection, solving with respect to the vector P the spatial position of the point belonging

570

6 Motion Analysis

to the plane is obtained again: P=

pZ f

(6.163)

which replaced in the equation of the plane (6.162) and resolving with respect to the inverse of the depth Z1 of P we have nx x + n y y + nz f 1 p·n = = Z fd fd

(6.164)

Substituting Eq. (6.164) in the equations of the motion field (6.161) we get  1  2 a1 x + a2 x y + a3 f x + a4 f y + a5 f 2 fd  1  a1 x y + a2 y 2 + a6 f y + a7 f x + a8 f 2 vy = fd

vx =

where

a1 = −d y + Tz n x a3 = Tz n z − Tx n x

a2 = d x + Tz n y a4 = d z − Tx n y

a5 = −d y − Tx n z a7 = −d z − Ty n x

a6 = Tz n z − Ty n y a8 = d x − Ty n z

(6.165)

(6.166)

Equation (6.165) define the motion field of a plane surface represented by a quadratic polynomial expressed in the coordinates of the image plane (x, y) at the distance f of the projection center. With the new Eq. (6.165), having an estimate of the optical flow (vx , v y ) in different points of the image, it would be possible to theoretically recover the eight independent parameters (3 of T , 3 of , and 2 of n, with known f ) describing the motion and structure of the plane surface. It is sufficient to have at least 8 points in the image plane to estimate the 8 coefficients seen as global parameters to describe the motion of the flat surface. Longuet-Higgins [41] highlighted the ambiguities in recovering the structure of the plane surface from the instantaneous knowledge of the optical flow (vx , v y ) using Eq. (6.165). In fact, different planes with different motion can produce the same flow field. However, if the n and T vectors are parallel, these ambiguities are attenuated.

6.6.1 Motion Analysis from the Optical Flow Field Let us now analyze what information can be extracted from the knowledge of optical flow. In a stationary environment consisting of rigid bodies, whose depth can be known, a sequence of images is acquired by a camera moving toward such objects or vice versa. From each pair of consecutive images, it is possible to extract the optical flow using one of the previous methods. From the optical flow, it is possible to extract information on the structure and motion of the scene. In fact, from the

6.6 Analytical Structure of the Optical Flow of a Rigid Body

571

optical flow map, it is possible, for example, to observe that regions with small velocity variations correspond to single image surfaces including information on the structure of the observed surface. Regions with large speed variations contain information on possible occlusions, or they concern the areas of discontinuity of the surfaces of objects even at different distances from the observer (camera). A relationship, between the orientation of the surface with respect to the observer and the small variations of velocity gradients, can be derived. Let us now look at the type of motion field induced by the model of perspective motion (pinhole) described by general Eq. (6.161) in the hypothesis of a rigid body subject to roto-translation.

6.6.1.1 Motion with Pure Translation In this case, it is assumed that = 0 and the induced flow field, according to (6.161), is modeled by the following: Tx f − Tz x Z Ty f − Tz y vy = Z vx =

(6.167)

We can now distinguish two particular cases related to translational motion: 1. Tz = 0, which corresponds to translation motion at a constant distance Z from the observer (lateral motion). In this case the induced flow field is represented with flow vectors parallel to each other in a horizontal motion direction. See the fourth illustration in Fig. 6.5 which shows an example of the optical flow induced by lateral motion from right to left. In this case, the equations of the pure translational motion model, derived from (6.167), are vx = f

Tx Z

vy = f

Ty Z

(6.168)

It should be noted that if Z varies the length of the parallel vectors varies inversely proportional with Z . 2. Tz = 0, which corresponds to translation motion (with velocity Tz > 0) in approaching or in moving away from the object of the observer (with velocity Tz < 0) or vice versa. In this case, the induced flow field is represented by radial vectors emerging from a common point called Focus Of Expansion—FOE (if the observer moves toward the object), or is known as Focus Of Contraction— FOC when the radial flow vectors converge in the same point, in which case the observer moves away from the object (or in the opposite case if the observer is stationary and the object moves away). The first two images in Fig. 6.5 show the induced radial flows of expansion and contraction of this translational motion, respectively. The length of the vectors is inversely proportional to the distance Z and is instead directly proportional to the distance in the image plane of p from the FOE located in p0 = (0, 0) (assuming that the Z -axis and the direction

572

6 Motion Analysis

(a)

(b) y

y

FOE

Z

y

Roll rotation center Y

Y

Y

Tz x

(c)

FOE

Z

Z X

x

Tx +Tz

X

x

Ωz

X

Fig. 6.35 Example of flow generated by translational or rotary motion. a Flow field induced by longitudinal translational motion T = (0, 0, Tz ) with the Z -axis coinciding with the velocity vector T; b Flow field induced by translational motion T = (Tx + Tz ) with the FOE shifted with respect to the origin along the x-axis in the image plane; c Flow field induced by the simple rotation (known as r oll) of the camera around the longitudinal axis (in this case, it is the Z axis)

of the velocity vector Tz are coincident), where the flow velocity is zeroed (see Fig. 6.35a). In the literature, FOE is also known as vanish point. In the case of lateral translational motion seen above, the FOE can be thought to be located at infinity where the parallel flow vectors converge (the FOE corresponds to a vanishing point). Returning to the FOE, it represents the point of intersection between the direction of motion of the observer and the image plane (see Fig. 6.35a). If the motion also has a translation component in the direction of X , with the resulting velocity vector T = Tx + Tz , the FOE appears in the image plane displaced horizontally with the flow vectors always radial and convergent in the position p0 = (x0 , y0 ) where the relative velocity is zero (see Fig. 6.35b). This means that from a sequence of images, once the optical flow has been determined, it is possible to calculate the position of the FOE in the image plane, and therefore know the direction of motion of the observer. We will see in the following how this will be possible considering also the uncertainty in the localization of the FOE, due to the noise present in the sequence of images acquired in the estimation of the optical flow. The flow fields shown in the figure refer to ideal cases without noise. In real applications, it can also be had that several independent objects are in translational motion, the flow map is always radial but with different FOEs, however each is associated with the motion of the corresponding object. Therefore, before analyzing the motion of the single objects, it is necessary to segment the flow field into regions of homogeneous motion relative to each object.

6.6.1.2 Motion with Pure Rotation In this case, it is assumed that T = 0 and the induced flow field, according to (6.161), is modeled by the rotation component only. The only rotation around the Z -axis (called roll rotation,9 with z = 0, and x = y = 0), at a constant distance 9 In the aeronautical context, the attitude of an aircraft (integral with the axes (X, Y,

Z )), in 3D space, is indicated with the angles of rotation around the axes, indicated, respectively, as lateral, vertical and longitudinal. The longitudinal rotation, around the Z -axis, indicates the roll, the lateral one,

6.6 Analytical Structure of the Optical Flow of a Rigid Body

573

from the object, induces a flow that is represented from vectors oriented along the points tangent to concentric circles whose center is the point of rotation of the points projected in the image plane of the object in rotation. In this case, the FOE does not exist and the flow is characterized by the center of rotation with respect to which the vectors of the flow orbit (see Fig. 6.35c) tangent to the concentric circles. The pure rotation around the X -axis (called pitch rotation) or the pure rotation around the axis of Y (called yaw rotation) induces in the image plane a center of rotation of the vectors (no longer oriented tangents along concentric circles but positioned according to the perspective projection) shifted towards the FOE, wherein these two cases it exists.

6.6.2 Calculation of Collision Time and Depth The collision time is the time required by the observer to reach the contact with the surface of the object when the movement is of pure translation. In the context of pure translational motion between scene and observer, the radial map of optical flow in the image plane is analytically described by Eq. (6.167) valid for both Tz > 0 (observer movement towards the object) with the radial vectors emerging from the FOE, and both for Tz < 0 (observer that moves away from the object) with the radial vectors converging in the FOC. A generic point P = (X, Y, Z ) of the scene, with translation velocity T = (0, 0, Tz ) (see Fig. 6.35a) in the image plane, it moves radially with velocity v = (vx , v y ) expressed by Eq. (6.168) and, remembering the perspective projection Eq. (6.152), is projected in the image plane in p = (x, y). We have seen how in this case of pure translational motion, the vectors of the optical flow converge at the FOE point p0 = (x0 , y0 ) where they vanish, that is, they cancel with (vx , v y ) = (0, 0) for all vectors. Therefore, in the FOE point Eq. (6.167) cancel each other thus obtaining the FOE coordinates given by x0 = f

Tx Tz

y0 = f

Ty Tz

(6.169)

The same results are obtained if the observer moves away from the scene, and in this case, it is referred to as a Focus Of Contraction (FOC). We can now express the relative velocity (vx , v y ) of the points p = (x, y) projected in the image plane with respect to their distance from p0 = (x0 , y0 ), that is, from the FOE, combining Eq. (6.169) and the equations of the pure translational motion

around the X -axis, indicates the pitch, and the ver tical one, around the Y axis indicates the yaw. In the robotic context (for example, in the case of an autonomous vehicle) the attitude of the camera can be defined with 3 degrees of freedom with the axes (X, Y, Z ) indicating the lateral direction (side-to-side), ver tical (up-down), and camera direction (looking). The rotation around the lateral, up-down, and looking axes retain the same meaning as the axes considered for the aircraft.

574

6 Motion Analysis

Y P(XP,YP,ZP) y

Image plane at time t+1 p’(xp’,yp’)

p(xp,yp)

Z f COP(t+1)

x

X

Image plane at time t

y FOE

x ΔZ=Z-Z’

f

COP(t)

Camera motion with velocity Tz

Fig. 6.36 Geometry of the perspective projection of a 3D point in the image plane in two instants of time while the observer approaches the object with pure translational motion

model (6.167) by obtaining Tz x − Tz x0 Tz = (x − x0 ) Z (x, y) Z (x, y) Tz y − Tz y0 Tz = (y − y0 ) vy = Z (x, y) Z (x, y)

vx =

(6.170)

It is pointed out that (6.170) represent geometrically the radial map of the optical flow in the hypothesis of pure translational motion. Moreover, it is observed that the length of the flow vectors are proportional to their distance (p − p0 ) from the FOE and inversely proportional to the depth Z . In other words, the apparent velocity of the points of the scene increases with the approach of the observer to them. Figure 6.35a shows the example of radial expansion flow induced with Tz > 0 (it would result instead of contraction with Tz < 0). Now let’s see how to estimate, from the knowledge of the optical flow, derived from a sequence of images: – The Time To Collision (TTC) of the observer from a point of the scene without knowing his distance and the approach velocity. – The distance d of a point of the scene from the observer moving at a constant velocity Tz parallel to the Z -axis in the direction of it. This is the typical situation of an autonomous vehicle that tries to approach an object and must be able to predict the collision time assuming that it moves with constant velocity Tz . In these applications, it is strategic that this prediction occurs without knowing the speed and the distance instant by instant from the object. While in other situations, it is important to estimate a relative distance between vehicle and object without knowing the translation velocity.

6.6 Analytical Structure of the Optical Flow of a Rigid Body

575

6.6.2.1 TTC Calculation Let us first consider the TTC calculation (often also referred to as time to contact). For simplicity, considering that the rotational component has no effect, we represent this situation with Fig. 6.36 considering only the projection Y–Z of the 3D reference system and the y-axis of the image plane. Let P = (Y P , Z P ) be a generic point in the scene, p = (x p , y p ) its perspective projection in the image plane, f the distance between image plane (perpendicular to the Z axis) and the Center Of Projection COP (assumed pinhole model). If P (or the observer but with the stationary object, or vice versa) moves with velocity Tz its perspective projection results y p = YZ PP while the apparent velocity in the image plane, according to Eq. (6.170), is given by vy =

y p Tz ZP

(6.171)

where we assume that the FOE has coordinates (x0 , y0 ) = (0, 0) in the image plane. If the velocity vector Tz was not aligned with the Z -axis, the FOE would be shifted and its coordinates (x0 , y0 ) can be calculated with the intersection of at least two flow vectors, projections of 3D points of the object. We will see later the most accurate way to determine the coordinates of the FOE considering also the possible noise derived from the calculation of the optical flux from the sequence of images. Returning to Eq. (6.171) of the instantaneous velocity v y of the projection of P in the image plane in p = (x p , y p ), and we divide both members by y p and then take the reciprocal, we get yp yp ZP = =⇒ τ= (6.172) vy Tz vy We have essentially obtained that the time to collision τ is given by the relationship between observer–object distance Z P and velocity Tz , which represents the classic way to estimate TTC but they are quantities that we do not know with a single camera, but above all, we have the important result that we wanted, namely that the time to collision τ is also given by the ratio of the two measurements deriving from the optical flow, y p (length of the vector of the optical flow obtained from the distance y p − x0 from the FOE) and v y (flow velocity vector ∂∂ty ) that can be estimated from the sequence of images, in the hypothesis of translational motion with constant speed. The accuracy of τ depends on the accuracy of the FOE position and the optical flow. Normally the value of τ is considered acceptable if the value of y p exceeds a threshold value in terms of the number of pixels to be defined in relation to the velocity Tz . At same Eq. (6.172), of the time to collision τ , we arrive by considering the observer in motion toward the stationary object as represented in Fig. 6.36. The figure shows two time instants of the P projections in the image plane moving with velocity Tz toward the object. At the instant t its perspective projection p in the image Y plane, according to (6.152), results in y p = f Z pp . In time t, the projection p moves away from the FOE as the image plane approaches P moving radially in p at time t + 1. This dynamic is described by differentiating y with respect to time t obtaining

576

6 Motion Analysis

∂ yp = f ∂t

 ∂Y P ∂t

ZP

 ∂Zp − f YP

∂t

Z 2P

(6.173)

This expression can be simplified by considering, the typology of motion T = P (0, 0, Tz ) for which ∂Y ∂t = 0, which for the perspective projection (Eq. 6.152) we have Y P = y p ZfP and that

∂Zp ∂t

= Tz . Replacing in (6.173), we get  ∂ yp Tz = −y p ∂t ZP

(6.174)

Dividing as before, both members for y p and taking the reciprocal, we finally get the same expression (6.172) of the time to collision τ . It is also observed that τ does not depend on the focal length of the optical system or the size of the object, but depends only on the observer–object distance Z P and the translation velocity Tz . With the same principle with which TTC is estimated, we could calculate the size of the object (useful in navigation applications where we want to estimate the size of an obstacle object) reformulating the problem in terms of τ = hht , where h is the height (seen as a scale factor) of the obstacle object projected in the image plane and h t (in analogy to v y ) represents the time derivative of how the scale of the object varies. The reformulation of the problem in these terms is not to estimate the absolute size of the object with τ but to have an estimate of how its size varies temporally between one image and another in the sequence. In this case, it is useful to estimate τ also in the plane X–Z according to Eq. (6.170).

6.6.2.2 Depth Calculation It is not possible to determine the absolute distance between the observer and the object with a single camera. However, we can estimate a relative distance between any two points of an object with respect to the observer. According to Eq. (6.172), we consider two points of the object (which moves with constant speed Tz ) at the time t at the distance Z 1 (t) and Z 2 (t) and we will have y1 (1) v y (t)

=

Z 1 (t) Tz

y2

=

Z 2 (t) Tz

(2)

v y (t)

(6.175)

By dividing member by member, we get (1)

Z 2 (t) y2 (t) v y (t) = · Z 1 (t) y2 (t) v(2) y (t)

(6.176)

6.6 Analytical Structure of the Optical Flow of a Rigid Body

577

from which it emerges that it is possible to calculate the relative 3D distance for any (t) using the optical flow measuretwo points of the object, in terms of the ratio ZZ 21 (t) ments (velocity and distance from the FOE) derived between adjacent images of the image sequence. If for any point of the object the accurate distance Z r (t) is known, according to (6.176) we could determine the instantaneous depth Z i (t) of any point as follows: (r ) yi (t) v y (t) · (i) (6.177) Z i (t) = Z r (t) yr (t) v y (t) In essence, this is the approach underlying the reconstruction of the 3D structure of the scene from the motion information, in this case based on the optical flow derived from a sequence of varying space–time images. With a single camera, the 3D structure of the scene is estimated only within a scale factor.

6.6.3 FOE Calculation In real applications of an autonomous vehicle, the flow field induced in the image plane is radial generated by the dominant translational motion (assuming a flat floor), with negligible r oll and pitch rotations and possible yaw rotation (that is, rotation around the axis Y perpendicular to the floor, see the note 9 for the conventions used for vehicle and scene orientation). With this radial typology of the flow field, the coordinates (x0 , y0 ) of FOE (defined by Eq. 6.169) can be determined theoretically by knowing the optical flow vector of at least two points belonging to a rigid object. In these radial map conditions, the lines passing through two flow vectors intersect at the point (x0 , y0 ), where all the other flow vectors converge, at least in ideal conditions as shown in Fig. 6.35a. In reality, the optical flow observable in the image plane is induced by the motion of the visible points of the scene. Normally, we consider the corners and edges that are not always easily visible and univocally determined. This means that the projections in the image plane of the points, and therefore the flow velocities (given by the 6.170) are not always accurately measured by the sequence of images. It follows that the direction of the optical flow vectors is noisy and this involves the convergence of the flow vectors not in a single point. In this case, the location of the FOE is estimated approximately as the center of mass of the region [42], where the optical flow vectors converge. Calculating the location of the FOE is not only useful for obtaining information on the structure of the motion (depth of the points) and the collision time, but it is also useful for obtaining information relating to the direction of motion of the observer (called heading) not always coinciding with the optical axis. In fact, the flow field induced by the translational motion T = (Tx + Tz ) produces the FOE shifted with respect to the origin along the x-axis in the image plane (see Fig. 6.35b). We have already shown earlier that we cannot fully determine the structure of the scene from the flow field due to the lack of knowledge of the distance Z (x, y) of

578

6 Motion Analysis

3D points as indicated by Eq. (6.170), while the position of the FOE is independent of Z (x, y) according to Eq. (6.169). Finally, it is observed (easily demonstrated geometrically) that the amplitude of the velocity vectors of the flow is dependent on the depth Z (x, y) while the direction is independent. There are several methods used for the estimation of FOE, many of which use calibrated systems that separate the translational and rotary motion component or by calculating an approximation of the FOE position by setting a function of minimum error (which imposes constraints on the correspondence of the points of interest of the scene to be detected in the sequence of images) and resolving to the least squares, or simplifying the visible surface with elementary planes. After the error minimization process, an optimal FOE position is obtained. Other methods use the direction of velocity vectors and determine the position of the FOE by evaluating the maximum number of intersections (for example, using the Hough transform) or by using a multilayer neural network [43,44]. The input flow field does not necessarily have to be dense. It is often useful to consider the velocity vectors associated with points of interest of which there is good correspondence in the sequence of images. Generally, they are points with high texture or corners. A least squares solution for the FOE calculation, using all the flow vectors in the pure translation context, is obtained by considering the equations of the flow (6.170) and imposing the constraint that eliminates the dependence of the translation velocity Tz and depth Z , so we have x − x0 vx = (6.178) vy y − y0 This constraint is applied for each vector (vxi , v yi ) of the optical flow (dense or scattered) that contribute to the determination of the position (x0 , y0 ) of the FOE. In fact, placing (6.178) in the matrix form:    x0  v yi −vxi = xi v yi − yi vxi (6.179) y0 we have a highly overdetermined linear system and it is possible to estimate the FOE position (x0 , y0 ) from the flow field (vxi , v yi ) with the least-squares approach: x0 = (A T A)−1 A T b (6.180) y0 where



⎤ v y1 −vx1 A = ⎣· · · · · · ⎦ v yn −vxn



⎤ x1 v y1 − y1 vx1 ⎦ ··· b=⎣ xn v yn − yn vxn

(6.181)

6.6 Analytical Structure of the Optical Flow of a Rigid Body

579

The explicit coordinates of the FOE are x0 = y0 =

vx2i

v yi bi − vx2i v2yi −

vxi v yi



vxi v yi vxi v yi

2

vxi bi

v yi bi − v2yi vxi bi 2  2 2 vxi v yi − vxi v yi

(6.182)

6.6.3.1 Calculation of TTC and Depth Knowing the FOE Known the position (x0 , y0 ) of FOE (given by Eq. 6.182), calculated with an adequate accuracy (as well estimated with the linear least squares method), it is possible to estimate the collision time τ and the depth Z (x, y), always in the context of pure translational motion, in an alternative way to the procedure used in the Sect. 6.6.2. This new approach combines the FOE constraint imposed by Eq. (6.178) with Eq. (6.9) of the optical flow I x vx + I y v y + It = 0, described in Sect. 6.4. Dividing the optical flow equation by vx and then combining with the FOE constraint equation (similarly for v y ) we obtain the flow equations in the form: (x − x0 )It (x − x0 )I x + (y − y0 )I y (y − y0 )It vy = − (x − x0 )I x + (y − y0 )I y

vx = −

(6.183)

where I x , I y , and It are the first partial space–time derivatives of the adjacent images of the sequence. Combining then, the equations of the flow (6.170) (valid in the context of translational motion with respect to a rigid body) with the equations of the flow (6.183) we obtain the following relations: (x − x0 )I x + (y − y0 )I y Z = Tz It   Tz (x − x0 )I x + (y − y0 )I y Z (x, y) = It

τ=

(6.184)

These equations express the collision time τ and the depth Z (x, y), respectively, in relation to the position of the FOE and the first derivatives of the images. The estimate of τ and Z (x, y) calculated, respectively, with (6.184) may be more robust than those calculated with Eqs. (6.172) and (6.176) having determined the position of the FOE (with the least-squares approach, closed form) considering only the direction of the optical flow vectors and not the module.

580

6 Motion Analysis

Y

Sy

Ry

0

Sx

X

Rx

p’(x’,y’) (Δx ,Δy )

y

x

P’(X’,Y’,Z’) Sz Rz

Z

p(x,y)

P(X,Y,Z)

Fig. 6.37 Geometry of the perspective projection of a point P, in the 3D space (X, Y, Z ) and in the image plane (x, y), which moves in P = (X , Y , Z ) according to the translation (Sx , S, y , Sz ) and r otation (Rx , R y , Rz ). In the image plane the vector (x, y) associated with the rototranslation is indicated

6.6.4 Estimation of Motion Parameters for a Rigid Body The richness of information present in the optical flow can be used to estimate the motion parameters of a rigid body [45]. In the applications of autonomous vehicles and in general, in the 3D reconstruction of the scene structure through the optical flow, it is possible to estimate the parameters of vehicle motion and depth information. In general, for an autonomous vehicle it is interesting to know its own motion (egomotion) in a static environment. In the more general case, there may be more objects in the scene with different velocity and in this case the induced optical flow can be segmented to distinguish the motion of the various objects. The motion parameters are the translational and rotational velocity components associated with the points of an object with the same motion. We have already highlighted above that if the depth is unknown only the rotation can be determined univocally (invariant to depth, Eq. 6.161), while the translation parameters can be estimated at less than a scale factor. The high dimensionality of the problem and the non-linearity of the equations derived from the optical flow make the problem complex. The accuracy of the estimation of motion parameters and scene structure is related to the accuracy of the flow field normally determined by sequences of images with good spatial and temporal resolution (using cameras with high frame rate). In particular, the time interval t between images must be very small in order to estimate with a good approximation the velocity x t of a point that has moved by x between two consecutive images of the sequence. Consider the simple case of rigid motion where the object’s points move with the same velocity and are projected prospectively in the image plane as shown in Fig. 6.37. The direction of motion is always along the Z -axis in the positive direction with the image plane (x, y) perpendicular to the Z -axis. The figure shows the position of a point P = (X, Y, Z ) at the time t and the new position in P = (X , Y , Z ) after

6.6 Analytical Structure of the Optical Flow of a Rigid Body

581

the movement at time t . The perspective projections of P in the image plane, in the two instants of time, are, respectively, p = (x, y) and p = (x , y ). The 3D displacement of P in the new position is modeled by the following geometric transformation: (6.185) P = R × P + S ⎡

⎤ Sx S = ⎣ Sy ⎦ Sz

⎤ 1 −Rz R y 1 −Rx ⎦ R = ⎣ Rz −R y Rx 1

where



(6.186)

It follows that, according to (6.185), the spatial coordinates of the new position of P depend on the coordinates of the initial position (X, Y, Z ) multiplied by the matrix of the rotation parameters (Rx , R y , Rz ) added together with the translation parameters (Sx , S y , Sz ). Replacing R and S in (6.185) we have X = X − R z Y + R y Z + Sx

Y = Y + Rz X − Rx Z + S y

X = X − X = Sx − Rz Y + R y Z =⇒



Y = Y − Y = S y + Rz X − Rx Z

(6.187)

Z = Z − Z = Sz − R y X + Rx Y

Z = Z − R y X + Rx Y + Sz

The determination of the motion is closely related to the calculation of the motion parameters (Sx , S y , Sz , Rx , R y , Rz ) which depend on the geometric properties of projection of the 3D points of the scene in the image plane. With the perspective model of image formation given by Eq. (6.152), the projections of P and P in the image plane are, respectively, p = (x, y) and p = (x , y ) with the relative displacements (x, y) given by X X x = x − x = f − f Z Z (6.188) Y Y y = y − y = f − f Z Z In the context of rigid motion and with very small 3D angular rotations, then the 3D displacements (X, Y, Z ) can be approximated by Eq. (6.187) which when replaced in the perspective Eq. (6.188) give the corresponding displacements (x, y) as follows:

x = x − x = y = y − y =

( f Sz −Sz x) Z

2

+ f R y − Rz y − Rx xfy + R y xf 1+

( f S y −Sz y) Z

Sz Z

+ Rx

y f

− R y xf 2

− f Rx + Rz x + R y xfy − Rx yf 1+

Sz Z

+ Rx

y f

(6.189)

− R y xf

The equations of the displacements (x, y) in the image plane in p = (x, y) are thus obtained in terms of the parameters (Sx , S y , Sz , Rx , R y , Rz ) and the addition of the depth Z for the point P = (X, Y, Z ) of the 3D object, assuming the perspective

582

6 Motion Analysis

projection model. Furthermore, if the images of the sequence are acquired with high frame rate, the components of the displacement (x, y), in the time interval between one image and the other, are very small, it follows that the variable terms in the denominator of Eq. (6.189) are small compared to unity, that is, y x Sz + Rx − R y  1 Z f f

(6.190)

With these assumptions, it is possible to derive the equations that relate motion in the image plane with the motion parameters by differentiating equations (6.189) with respect to time t. In fact, dividing (6.189) with respect to the time interval t and for t → 0 we have that the displacements (x, y) approximate (become ) the instant velocities (vx , v y ) in the image plane, known as optical flow. Similarly, in 3D space, the translation motion parameters (Sx , S y , Sz ) instead become translation velocity indicated with (Tx , Ty , Tz ), so for the rotation parameters (Rx , R y , Rz ), which become rotation velocity indicated with ( x , y , z ). The equations of the optical flow (vx , v y ) correspond precisely to Eq. (6.160) of Sect. 6.6. Therefore, these equations of motion involve velocities both in 3D space and in the 2D image plane. However, in real vision systems, information is that acquired from the sequence of space–time images based on the induced displacements on very small time intervals according to the condition expressed in Eq. (6.190). In other words, we can approximate the 3D velocity of a point P = (X, Y, Z ) with the equation V = T + × P (see Sect. 6.6) of a rigid body that moves with a translational velocity T = (Tx , Ty , Tz ) and a rotational velocity = ( x , y , z ), while the relative velocity (vx , v y ) of the 3D point projected in p = (x, y), in the image plane, can be approximated by considering the displacements (x, y) if the constraint Z Z  1 is maintained, according to Eq. (6.190). With these assumptions, we can now address the problem of estimating the motion parameters (Sx , S y , Sz , Rx , R y , Rz ) and Z starting from the measurements of the displacement vector (x, y) given by Eq. (6.189) that we can break them down into separate components of translation and rotation, as follows:

with

(x, y) = (x S , y S ) + (x R , y R )

(6.191)

f Sz − Sz x Z f S y − Sz y y S = Z

(6.192)

xy x2 + Ry f f xy y2 − Rx y R = − f Rx + Rz x + R y f f

(6.193)

x S =

and

x R = f R y − Rz y − Rx

6.6 Analytical Structure of the Optical Flow of a Rigid Body

583

As previously observed, the rotational component does not depend on the depth Z of a 3D point of the scene. Given that the displacement vector (x, y) is applied for each 3D point projected in the scene, the motion parameters (the unknowns) can be estimated with the least squares approach by setting a function error e(S, R,Z ) to be minimized given by e(S, R,Z ) =

n 

(xi − x Si − x Ri )2 + (yi − y Si − y Ri )2

(6.194)

i=1

where (xi , yi ) is the measurable displacement vector for each point i of the image plane, (x R , y R ) and (x S , y S ) instead they are, respectively, the rotational and translational components of the displacement vector. With the perspective projection model it is not possible to estimate an absolute value for the translation vector S and for the depth Z for each 3D point. In essence, they are estimated at less than a scale factor. In fact, in the estimate of the translation vector (x S , y S ) from Eq. (6.192) we observe that multiplied both by a constant c this equation is not altered. Therefore, by scaling the translation vector by a constant factor, at the same time the depth is increased by the same factor without producing changes in the displacement vector in the image plane. Of the displacement vector it is possible to estimate the direction of motion and the relative depth of the 3D point from the image plane. According to the strategy proposed in [45] it is useful to set the function error by eliminating first ZS with a normalization process. Let U = (Ux , U y , Uz ) the normalized motion direction vector and r the translation component module S = (Sx , S y , Sz , ). The normalization of U is given by (Ux , U y , Uz ) =

(Sx , S y , Sz ) r

(6.195)

Let Z¯ be the relative depth given by r Z¯ = Zi

∀i

(6.196)

At this point we can rewrite in normalized form the translation component (6.192) which becomes x S xU = = U x − Uz x Z¯ (6.197) y S yU = = U y − Uz y Z¯ Rewriting the error function (6.194) with respect to U (for Eq. 6.197) we get e(U, R, Z¯ ) =

n  (xi − xUi Z¯ i − x Ri )2 + (yi − yUi Z¯ i − y Ri )2 (6.198) i=1

We are now interested in minimizing this error function for all possible values of Z¯ i .

584

6 Motion Analysis

Differentiating Eq. (6.198) with respect to Z , setting the result to zero and resolving with respect to Z¯ i we obtain (yi − y Ri )yUi Z¯ i = (xi − x Ri )xUi + xU2 i + yU2 i

∀i

(6.199)

Finally, having estimated the relative depths Z¯ i , it is possible to replace them in the error function (6.198) and obtain the following final formulation: e(U, R) =

n  [(xi − x Ri )yUi − (yi − y Ri )xUi ]2 xU2 i + yU2 i i=1

(6.200)

With this artifice the depth Z has been eliminated from the error function which is now formulated only in terms of U and R. Once the motion parameters have been estimated with (6.200) it is possible to finally estimate with (6.199) the optimal depth values associated to each point ith of the image plane. In the case of motion with pure rotation, the error function (6.200) to be minimized becomes n  (xi − x Ri )2 + (yi − y Ri )2 (6.201) e(R) = i=1

where the rotational motion parameters to be estimated are the three components of R. This is possible differentiating the error function with respect to each of the components (Rx , R y , Rz ), setting the result to zero and solving the three linear equations (remembering the nondependence on Z ) as proposed in [46]. The three linear equations that are obtained are n  [(xi − x Ri )x y + (yi − y Ri )(y 2 + 1)] = 0 i=1 n 

[(xi − x Ri )(x 2 + 1) + (yi − y Ri )x y] = 0

(6.202)

i=1 n  [(xi − x Ri )y + (yi − y Ri )x] = 0 i=1

where (xi , yi ) is the displacement vector ith for the image point (x, y) and (x Ri , y Ri ) are the rotational components of the displacement vector. As reported in [46], to estimate the rotation parameters (Rx , R y , Rz ) for the image point (x, y), Eq. (6.202) are expanded and rewritten in matrix form obtaining ⎤ ⎡ ⎤−1 ⎡ ⎤ a d f k Rx ⎣ Ry ⎦ = ⎣d b e ⎦ ⎣ l ⎦ f e c m Rz ⎡

(6.203)

6.6 Analytical Structure of the Optical Flow of a Rigid Body

585

where a=

n 2 2 i=1 [x y

d=

n 2 i=1 [x y(x

k=

n i=1 [ux y

+ (y 2 + 1)] b = + y 2 + 2)]

+ v(y 2 + 1)] l =

n 2 i=1 [(x

e=

+ 1) + x 2 y 2 ] c = n i=1

n 2 i=1 [u(x

y

+ 1) + vx y] m =

n 2 i=1 (x

f =

+ y2)

n i=1

n i=1 (x y

x

(6.204)

− vx)

In Eq. (6.204), (u, v) indicates the flow measured for each pixel (x, y). It is proved that the matrix in (6.203) is diagonal and singular if the image plane is symmetrical with respect to the axes x and y. Moreover, if the image plane is reduced in size, the matrix could be ill conditioned, it results in an inaccurate estimation of the summations (k, l, m) calculated by the observed flow (u, v) which would be very amplified. But this in practice should not happen because generally it is not required to determine the rotary motion around the optical axis of the camera by observing the scene with a small angle of the field of view.

6.7 Structure from Motion In this section, we will describe an approach known as Structure from Motion (SfM) to obtain information on the geometry of the 3D scene starting from a sequence of 2D images acquired by a single uncalibrated camera (the intrinsic parameters and its position are not known). The three-dimensional perception of the world is a feature common to many living beings. We have already described the primary mechanism used by the ster eopsis human vision system (i.e., the lateral displacement of objects in two retinal images) to perceive the depth and obtain 3D information of the scene by fusing the two retinal images. The computer vision community has developed different approaches for 3D reconstruction of the scene starting from 2D images observed from multiple points of view. One approach is to find the correspondence of points of interest of the 3D scene (as happens in the stereo vision) observed in 2D multiview images and by triangulation construct a 3D trace of these points. More formally (see Fig. 6.38), given n projected points (xi j ; i = 1, n and j = 1, m) in the m 2D images, the goal is to find all projection matrices P j ; j = 1, . . . , m (associated with motion) consisting of the structure of n observed 3D points (X i ; i = 1, . . . , n). Fundamental to the SfM approach is the knowledge of the camera projection matrix (or camera matrix) that is linked to the geometric model of image formation (camera model).

6.7.1 Image Projection Matrix Normally, the simple model of perspective projection is assumed (see Fig. 6.39) which corresponds to the ideal pinhole model where a 3D point Q = (X, Y, Z ), whose coordinates are expressed in the reference system of the camera, it is projected in the image plane in q whose coordinates x = (x, y) are related to each other with

586

6 Motion Analysis

Xi

xi1

xim xij Pm Pj

Fig. 6.38 Observation of the scene from slightly different points of view obtaining a sequence of m images. n points of interest of the 3D scene are taken from the m images to reconstruct the 3D structure of the scene by estimating the projection matrices P j associated with the m observations Image Discretization

u

t Op

v

Z C Y

y

la

xis

o al P cip Prin

c

f

ica

Q(X,Y,Z)

int

x q(x,y)

X

T T

R

Yw O

Zw Xw

Fig. 6.39 Intrinsic and extrinsic parameters in the pinhole model. A point Q, in the camera’s 3D reference system (X, Y, Z ), is projected into the image plane in q = (x, y), whose coordinates are defined with respect to the principal point c = (cx , c y ) according to Eq. (6.206). The transformation of the coordinates (x, y) of the image plane into the sensor coordinates, expressed in pixels, is defined by the intrinsic parameters with Eq. (6.207) which takes into account the translation of the principal point c and sensor resolution. The transformation of 3D point coordinates of a rigid body from the world reference system (X w , Yw , Z w ) (with origin in O) to the reference system of the camera (X, Y, Z ) with origin in the projection center C is defined by the extrinsic parameters characterized by the roto-translation vectors R, T according to Eq. (6.210)

6.7 Structure from Motion

587

the known nonlinear equations x = f XZ and y = f YZ , where f indicates the focal length that has the effect of introducing a scale change of the image. In the figure, for simplicity, the image plane is shown in front of the optical projection center C. In reality, the image plane is located on the opposite side with the image of the inverted scene. Expressing in homogeneous coordinates we have the equation of the perspective projection in the following matrix form: ⎡ X⎤ ⎡ ⎤ f Z ⎡ ⎤ ⎡ ⎤ X ⎡ ⎤ ⎥ ⎢ fX f 0 00 ⎢ ⎥ x ⎥ ⎢ ⎣ y ⎦ = ⎢ f Y ⎥ = ⎣ f Y ⎦ = ⎣ 0 f 0 0⎦ ⎢ Y ⎥ (6.205) ⎢ Z⎥ ⎣Z⎦ ⎦ ⎣ Z 0 0 10 1 1 1 Setting f = 1, we can rewrite (6.205), unless an arbitrary scale factor, as follows: ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ X x˜ 1000 ⎢ ⎥ ⎣ y˜ ⎦ ≈ ⎣0 1 0 0⎦ ⎢ Y ⎥ =⇒ x˜ ≈ A X˜ (6.206) ⎣Z⎦ 1 0010 1 where ≈ indicates that the projection x˜ is defined less than a scale factor. Furthermore, x˜ is independent of the size of X that is only depends on the direction of the 3D point relative to the camera and not how far from it is distant. The matrix A represents the geometric model of the camera and is known as canonical perspective projection matrix.

6.7.1.1 Intrinsic Calibration Parameters ˜ of the Another aspect to consider is the relation that transforms the coordinates x, points q projected in the image plane, in the coordinates u˜ = [u v 1]T of the sensor in terms of pixels. This transformation takes into account the discretization of the image and the translation of the image coordinates with respect to the principal point (u 0 , v0 ) (projection in the image plane of the perspective projection center). The principal point corresponds to the intersection point of the optical axis with the image plane of the camera (see Fig. 6.39). To take into account the translation of the principal point with respect to the origin of the 2D reference system of the image, it is sufficient to add the horizontal and vertical translation component defined by (u 0 , v0 ) to the equation of the perspective projection (6.206). To transform the coordinates expressed in unit of length into pi xel, it is necessary to know the horizontal resolution px and vertical resolution p y of the sensor, normally expressed in pixel/mm.10

10 The physical position of a point projected in the image plane, normally expressed with a metric, for example, in mm, must be transformed into units of the image sensor expressed in pi xel which typically does not correspond to a metric such as for example in mm. The physical image plane is discretized by the pixels of the sensor characterized by its horizontal and vertical spatial resolution

588

6 Motion Analysis

Assuming the translation of the central point (u 0 , v0 ) already expressed in pixels, the coordinates in the image plane expressed in pixels are given by u = px

fX X + u 0 = αu + u 0 Z Z

v = py

fY Y + v0 = αv + v0 Z Z

(6.207)

where αu = px f and αv = p y f are the horizontal and vertical scale factors that transform the physical coordinates of the points projected in the image plane into pixels, and u0 = [u 0 v0 ]T is the principal point of the sensor already expressed in pixels. The parameters αu , αv , u 0 , and v0 represent the intrinsic (or internal) parameters of the camera. Normally the pixels are squares for which αu = αv = α and so α is considered as the focal length of the optical system expressed in terms of pixel units. An additional intrinsic parameter to consider is the skew parameter s of the sensor (usually assumed to be zero) given by s = αv tan β, factor, which takes into account the nonrectangular area of the pixels. Considering (6.207), the 2D transformation, which relates the projected points in the image plane into the sensor coordinate system, is given by ⎡ ⎤ ⎡ ⎤⎡ ⎤ u˜ αu s u 0 x˜ ⎣ v˜ ⎦ ≈ ⎣ 0 αv v0 ⎦ ⎣ y˜ ⎦ =⇒ u˜ = K x˜ (6.208) 1 0 0 1 1 where K is a 3×3 triangular matrix known as camera calibration matrix to calculate for each optical system used.

6.7.1.2 Extrinsic Calibration Parameters A more general projection model, which relates 3D points of the scene with respect to the 2D image plane, involves the geometric transformation that projects, 3D points of a rigid body expressed in homogeneous coordinates in the world reference system X w = [X w Yw Z w ]T , in the reference system of the camera X = [X Y Z ]T always expressed in homogeneous coordinates (see Fig. 6.39). In essence, the reference system of the objects of the world (which originates in O w ) will be linked to the

expressed in pixels/mm. Therefore, the transformation of coordinates from mm to pixels introduces npi x y xx a horizontal scale factor px = npi dimx and vertical p y = dimy with which to multiply the physical image coordinates (x, y). In particular, npi x x × npi x y represents the horizontal and vertical resolution of the sensor in pixels, while dimx × dimy indicates the horizontal and vertical dimensions of the sensor given in mm. Often to define the pixel’s rectangles is given the aspect ratio as the ratio of width to height of the pixel (usually expressed in decimal form, for example, 1.25, or in fractional form as 5/4 to break free from the problem of the periodic decimal approximation). Furthermore, the coordinates must be translated with respect to the position of the principal point expressed in pixels since the center of the sensor’s pixel grid does not always correspond with the position of the principal point. The pixel coordinate system of the sensor is indicated with (u, v) with the principal point (u 0 , v0 ) given in pixels and assumed with the axes parallel to those of the physical system (x, y). The accuracy of the coordinates in pixels with respect to the physical ones depends on the resolution of the sensor and its dimensions.

6.7 Structure from Motion

589

reference system of the camera by a geometric relation that includes the camera orientation (through the rotation matrix R with a size of 3 × 3) and the translation vector T (3D vector indicating the position of the origin O w with respect to the system camera reference). This transformation into a compact matrix form is given by (6.209) X = RX w + T while in homogeneous coordinates it is written as follows: ⎡ ⎤ ⎡ ⎤ X Xw ⎢Y ⎥ ⎢ ⎥ ⎢ ⎥ = RT3×3 T 3×1 ⎢ Yw ⎥ =⇒ X˜ = M X˜ w ⎣Z⎦ ⎣ Zw ⎦ 01×3 1 1 1

(6.210)

where M is the 4 × 4 roto-translation matrix whose elements represent the extrinsic (or external) parameters that characterize the transformation between coordinates of the world and those of the camera.11 In particular, the rotation matrix R, although with 9 elements, has 3 degrees of freedom that correspond to the values of the rotation angles around the three axes of the reference system. The rotations on the three axes are convenient to express it through a single matrix operation instead of three separate elementary matrices for the three angles of rotation (φ, θ, ψ) associated, respectively, to the axes (X, Y, Z ). The translation vector has 3 elements and the extrinsic parameters are a total of 6 which characterize the position and the attitude of the camera. Combining now, the perspective projection Eq. (6.206), the camera calibration Eq. (6.208), and the rigid body transformation Eq. (6.210), we can get a unique equation that relates a 3D point, expressed in coordinates homogenous of the world X˜ w , with its projection in the image plane, expressed in homogeneous coordinates ˜ given by in pixels u, (6.211) u˜ ≈ P X˜ w where P is the perspective projection matrix, of size 3 × 4, expressed in the most general form: P 3×4 = K 3×3 [R3×3 T 3×1 ] (6.212) In essence, the perspective projection matrix P defined by (6.212) includes: the simple perspective transformation defined by A (Eq. 6.206), the effects of discretization

11 The

roto-translation transformation expressed by (6.209) indicates that the r otation R was performed first and then the translation T . Often it is reported with inverted operations, that is, before the translation and after the rotation, having thus X = R(X w − T ) = RX w + (−RT ) and in this case, in Eq. (6.210), the translation term T is replaced with −RT .

590

6 Motion Analysis

of the image plane associated with the sensor through the matrix K (Eq. 6.208), and the transformation that relates the position of the camera with respect to the scene by means of the matrix M (Eq. 6.210). The transformation (6.211) is based only on the pinhole perspective projection model and does not include the effects due to distortions introduced by the optical system, normally modeled with other parameters that describe radial and tangential distortions (see Sect. 4.5 Vol. I).

6.7.2 Methods of Structure from Motion Starting from Eq. (6.211), we can now analyze the proposed methods to solve the problem of 3D scene reconstruction by capturing a sequence of images with a single camera whose intrinsic parameters remain constant even if not known (uncalibrated camera) together without the knowledge of the motion [47,48]. The proposed methods are part of the problem of solving an inverse problem. In fact, with Eq. (6.211) we want to reconstruct the 3D structure of the scene (and the motion), that is, calculate the homogeneous coordinates of n points X˜ i (for simplicity we indicate, from now on, without a subscript w the 3D points of the scene) whose projection is known in homogeneous coordinates u˜ i j detected in m images characterized by the associated m perspective projection matrices P j unknowns. Essentially, the problem is reduced by estimating the m projection matrices P j and the n 3D points X˜ i , known the m · n correspondences u˜ i j found in the sequence of m images (see Fig. 6.38). We observe that with (6.211) the scene is reconstructed up to a scale factor having considered a perspective projection. In fact, if the points of the scene are scaled by a factor λ and we simultaneously scale the projection matrix by a factor λ1 , the points of the scene projected in the image plane remain exactly the same:  1 ˜ ˜ P (λ X) (6.213) u˜ ≈ P X = λ Therefore, the scene cannot be reconstructed with an absolute scale value. For recognition applications, even if the structure of the reconstructed scene shows some resemblance to the real one and reconstructed with an arbitrary scale, it still provides useful information. The methods proposed in the literature use, the algebraic approach [49] (based on the Fundamental matrix described in the Chap. 7), the factorization approach (based on the singular values decomposition—SVD ), the bundle adjustment approach [50,51] that iteratively refines the motion parameters and 3D structure of the scene minimizing an appropriate functional cost. In the following section, we will describe the method of factorization.

6.7.2.1 Factorization Method The f actori zation method proposed in [52] determines the scene structure and motion information from a sequence of images by assuming an orthographic projec-

6.7 Structure from Motion

591

tion. This simplifies the geometric model of projection of the 3D points in the image plane whose distance with respect to the camera can be considered irrelevant (ignored the scale factor due to the object–camera distance). It is assumed that the depth of the object is very small compared to the observation distance. In this context, no motion information is detected along the optical axis (Z -axis). The orthographic projection is a particular case of the perspective12 one where the orthographic projection matrix results: ⎡ ⎤ X ⎥ x 1000 ⎢ x=X ⎢Y ⎥ = =⇒ (6.214) y 0 1 0 0 ⎣Z⎦ y=Y 1 Combining Eq. (6.214) with the roto-translation matrix (6.210), we obtain an affine projection: ⎤⎡ ⎤⎡ ⎤ ⎡ 1 0 0 T1 X r11 r12 r13 0 ⎥ ⎢0 1 0 T2 ⎥ ⎢ Y ⎥ r r 0 r x 1000 ⎢ 21 22 23 ⎥⎢ ⎥⎢ ⎥ ⎢ = y 0 1 0 0 ⎣r31 r32 r33 0⎦ ⎣0 0 1 T3 ⎦ ⎣ Z ⎦ 0 0 0 1 000 1 1 ⎤⎡ ⎤ ⎡ (6.215) X 1 0 0 T1 ⎥⎢ ⎥ r r r 0 ⎢ ⎢0 1 0 T2 ⎥ ⎢ Y ⎥ = 11 12 13 ⎣ r21 r22 r23 0 0 0 1 T3 ⎦ ⎣ Z ⎦ 1 000 1 from which simplifying (eliminating the last column in the first matrix and the last row in the second matrix) and expressing in nonhomogeneous coordinates the orthographic projection is obtained combined with the extrinsic parameters of the rototranslation: ⎡ ⎤ X x r r r T = 11 12 13 ⎣ Y ⎦ + 1 = RX + T (6.216) y r21 r22 r23 T2 Z From (6.213), we know that we cannot determine the absolute positions of the 3D points. To factorize it is further worth simplifying (6.216) by assuming the origin of the reference system of 3D points coinciding with their center of mass, namely: n 1 Xi = 0 n

(6.217)

i=1

Now we can execute the centering of the points, in each image of the sequence, subtracting from the coordinates xi j the coordinates of their center of mass x¯ i j and 12 The distance between the projection center and the image plane is assumed infinite with focal length f → ∞ and parallel projection lines.

592

6 Motion Analysis

indicate with x˜ i j the new coordinates: n n 1 1 xik = R j X i + T j − (R j X k + T j ) n n k=1 k=1  n 1 = R j Xi − Xk = R j Xi n k=1   

x˜ i j = xi j −

(6.218)

=0

We are now able to factorize with (6.218) aggregating the data centered in large matrices. In particular, the of the 2D points centered x˜ i j are placed in   coordinates organized into two submatrices each of size m × n. a single matrix W = U V In the m rows of submatrix, U are placed the horizontal coordinates of the n 2D points centered relative to the m image. Similarly, in the m rows of the submatrix V the vertical coordinates of the n 2D centered points are placed. Thus we obtain the matrix W, called matrix of the measures  R1 of size 2m × n. In analogy to the W, we can relative to all the m images indicating with construct the rotation matrix M = R2 R1 = [ r11 r12 r13 ] and R2 = [ r21 r22 r23 ], respectively, the rows of the camera rotation matrix (Eq. 6.216) representing the motion information.13 Rewriting Eq. (6.218) in matrix form, we get ⎡ ⎤ ⎡ ⎤ R11 x˜11 x˜12 · · · x˜1n ⎢ ⎥ ⎢ .. ⎥ .. ⎢ ⎥ ⎢ . ⎥ . ⎢ ⎥ ⎢ ⎥ ⎢x˜m1 x˜m2 · · · x˜mn ⎥ ⎢ R1m ⎥   ⎢ ⎥=⎢ ⎥ (6.219) ⎢ y˜11 y˜12 · · · y˜1n ⎥ ⎢ R21 ⎥ X 1 X 2· · · X n ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ . ⎥ .. 3×n ⎣ ⎦ ⎣ .. ⎦ . y˜m1 y˜m2 · · · y˜mn R2m       2m×n

2m×3

In addition, by aggregating all the coordinates of the n 3D points centered in the matrix S = [ X 1 X 2 ··· X n ], called matrix of the 3D structure of the scene, we have the factorization of the measure matrix W as product of the motion matrix M and the structure matrix S, which in compact matrix form is W = MS

(6.220)

By virtue of the rank theorem, we have that the matrix of the observed centered measures W of size 2m × n has at the highest rank 3. This statement is immediate considering the properties of the rank. The rank of a matrix of size m × n is at most the minimum between m and n. In fact, the rank of a product matrix A · B is at 13 Recall

that the rows of R represent the coordinates in the original space of the unit vectors along the coordinate axes of the rotated space, while the columns of R represent the coordinates in the rotated space of unit vectors along the axes of the original space.

6.7 Structure from Motion

593

most the minimum between the rank of A and that of B. Applying this rank theorem to the factoring matrices M · S, we immediately get that the rank of W is 3. The importance of the rank theorem is evidenced by the fact that the 2m × n measures taken from the sequence of images are highly redundant, to reconstruct the 3D scene. It also informs us that quantitatively the 2m × 3 motion information and the 3 × n coordinates of the 3D points would be sufficient to reconstruct the 3D scene. Unfortunately, both these latter information are not known and to solve the problem of the reconstruction of the structure from motion the method of factorization has been proposed, seen as an overdetermined system that can be solved with the least squares method based on singular value decomposition (SVD). The SVD approach involves the following decomposition of W: U    VT W =   2m×n

(6.221)

2m×2m 2m×n n×n

where the matrices U and V are unitary and orthogonal (U T U = V T V = I, where I is the matrix identity), and  is a diagonal matrix with nonnegative elements in ascending order σ11 ≥ σ22 ≥ · · · uniquely determined by W. The values σi (W) ≥ 0 are the singular values of the decomposition of W and the columns ui and vi are the orthonormal eigenvectors of the symmetric matrices WW T and W T W, respectively. It should be noted that the properties of W are very much related to the properties of these symmetric matrices. The singular values σi (W) coincide with the nonnegative square roots of the eigenvalues λi of the symmetric matrices WW T and W T W. Returning to the factorization approach of W and considering that the property of the rank of a matrix is equal to the number of nonzero elements of the singular values, it is required that the first three values σi , i = 1, 3 are nonzero, while the rest are at zero. In practice, due to the noise we know that the singular values different from zero are more than 3, but by virtue of the rank theorem we can ignore all the others. Under these conditions, the SVD decomposition, given by Eq. (6.221), can be considered only with the first three columns of U, the first three rows of V and with the diagonal matrix  of size 3×3, thus obtaining the following SVD decomposition: ˆ = U  V T W     2m×n

(6.222)

2m×3 3×3 3×n

Essentially, according to the rank theorem considering only the three greatest singular values of W and the corresponding left and right eigenvectors, with (6.222) we get the ˆ can be considered best estimate of motion and structure information. Therefore, W as a good estimate of the ideal observed measures W which we can be defined as ˆ = (U [ ] 21 ) ([ ] 21 V T ) W       

2m×n

2m×3

3×n

=⇒

ˆ =M ˆS ˆ W

(6.223)

594

6 Motion Analysis

ˆ and S ˆ even if different from the ideal case M and S, always where the matrices M maintain the motion information of the camera and the structure of the scene, respecˆ and S, ˆ respectively, tively. It is pointed out that, except for noise, the matrices M induce a linear transformation of the true motion matrix M and of the true matrix of the scene structure S. If the observed measure matrix W is acquired with an adequate frame rate, appropriate to the camera motion, we can have a noise level low enough ˆ that it can be ignored. This can be checked by analyzing the singular values of W verifying that the ratio of the third to the fourth singular value is sufficiently large. We also point out that the decomposition obtained with (6.223) is not unique. In fact, any invertible matrix Q of size 3 × 3 would produce an identical decomposition ˆ as follows: ˆ with the matrices M ˆ Q and Q −1 M, of W ˆ = (M ˆ Q)( Q −1 S) ˆ = M( ˆ Q Q −1 )S ˆ =M ˆS ˆ W

(6.224)

ˆ that may not necessarily be Another problem concerns the row pairs R1T ; R2T of M 14 orthogonal. To solve these problems, we can find a matrix Q such that appropriate ˆ satisfy some metric constraints. Indeed, considering the matrix M ˆ as a rows of M linear transformation of the true motion matrix M (and similarly for the matrix of ˆ The Q ˆ Q and S = Q −1 S. scene structure) we can find a matrix Q such that M = M matrix is found by observing that the rows of the true motion matrix M, considered as 3D vectors, must have a unitary norm and the first m rows R1T must be orthogonal to the corresponding last m rows R2T . Therefore, the solution of Q is found with the system of equations deriving from the following metric constraints: ˆ iT Q Q T R1 ˆ i =1 R1 ˆ iT Q Q T R2 ˆ i =1 R2

(6.225)

ˆ iT Q Q T R2 ˆ i =0 R1 This system of 3 nonlinear equations can be solved by setting C = Q Q T and then solving the system with respect to C (which has 6 unknowns, the 6 elements of the symmetric matrix C) to get Q, with iterative methods, for example, the Newton method or with factorization methods based on Cholesky or SVD. The described factorization method is summarized in the following steps: 1. Compose the matrix of the measurements centered W from the n 3D points observed through the tracking on m images of the sequence acquired by a moving camera with appropriate frame rate. Better results are obtained with motion such that the images of the sequence guarantee the tracking of the points with few

14 Let’s

remember here from the properties of the rotation matrix R which is nor mali zed, that is, the squares of the elements in a row or in a column are equal to 1, and it is or thogonal, i.e., the inner product of any pair of rows or any pair of columns is 0.

6.7 Structure from Motion

595

occlusions and the distance from the scene is much greater than the depth of the objects of the scene to be reconstructed. 2. Organi ze 2D centered measures in the W matrix such that each pair of rows jth and (j+m)th contains the horizontal and vertical coordinates of the n points 3D projected in the jth image. While a column of W contains in the first m elements, respectively, the horizontal coordinates of a 3D point observed in the m images, and in the last m elements the vertical coordinates of the same point. It follows that W has dimensions 2m × n. 3. Calculate the decomposition of W = UV T with the SVD method, which produces the following matrices: – U of size 2m × 2m. –  of size 2m × n. – V T of size n × n. 4. From the original decomposition, consider the truncated decomposition to the first three largest singular values, forming the following matrices: – U of size 2m × 3 which corresponds to the column vectors of U. –  of size 3 × 3 which corresponds to the diagonal matrix with the largest singular values. – V T of size 3 × n which corresponds to the row vectors of V T . ˆ = U [ ] 21 and the structure matrix S ˆ = 5. De f ine the motion matrix M 1 T [ ] 2 V . ˆ and S ˆ 6. Eliminate the ambiguities by solving for Q to make the rows of M orthogonal. ˆ ˘ =M ˆ Q and S ˘ = Q −1 S. 7. The final solution is given by M Figure 6.40 shows the 3D reconstruction of the interior of an archaeological cave called Grotta dei Cervi15 with the presence of paintings dating back to the middle Neolithic period made with red ocher and bat guano. The walls have an irregular surface characterized by cavities and prominences that give them a strong threedimensional structure. The sequence of images and the process of detecting the points of interest (with the Harris method) together with the evaluation of the correspondences are based on the approach described in [53], while the 3D reconstruction of the scene is realized with the method of factorization. The 3D scene was recon-

15 The

Grotta dei Cervi (Deer Cave) is located in Porto Badisco near Otranto in Apulia–Italy at a depth of 26 m below sea level and represents an important cave, it is, in fact, the most impressive Neolithic pictorial complex in Europe, only recently discovered in 1970.

596

6 Motion Analysis

Fig. 6.40 Results of the 3D scene reconstructed with the factorization method. The first line shows the three images of the sequence with the points of interest used. The second line shows the 3D image reconstructed as described in [53]

structed with VRML software16 using a triangular mesh built on the points of interest shown on the three images and superimposing the texture of the 2D images.

References 1. J.J. Gibson, The Perception of the Visual World (Sinauer Associates, 1995) 2. T. D’orazio, M. Leo, N. Mosca, M. Nitti, P. Spagnolo, A. Distante, A visual system for real time detection of goal events during soccer matches. Comput. Vis. Image Underst. 113 (2009a), 622–632 3. T. D’Orazio, M. Leo, P. Spagnolo, P.L. Mazzeo, N. Mosca, M. Nitti, A. Distante, An investigation into the feasibility of real-time soccer offside detection from a multiple camera system. IEEE Trans. Circuits Syst. Video Surveill. 19(12), 1804–1818 (2009b) 4. A. Distante, T. D’Orazio, M. Leo, N. Mosca, M. Nitti, P. Spagnolo, E. Stella, Method and system for the detection and the classification of events during motion actions. Patent PCT/IB2006/051209, International Publication Number (IPN) WO/2006/111928 (2006) 5. L. Capozzo, A. Distante, T. D’Orazio, M. Ianigro, M. Leo, P.L. Mazzeo, N. Mosca, M. Nitti, P. Spagnolo, E. Stella, Method and system for the detection and the classification of events during motion actions. Patent PCT/IB2007/050652, International Publication Number (IPN) WO/2007/099502 (2007)

16 Virtual Reality Modeling Language (VRML) is a programming language that allows the simulation

of three-dimensional virtual worlds. With VRML it is possible to describe virtual environments that include objects, light sources, images, sounds, movies.

References

597

6. B.A. Wandell, Book Rvw: Foundations of vision. By B.A. Wandell. J. Electron. Imaging 5(1), 107 (1996) 7. B.K.P. Horn, B.G. Schunck, Determining optical flow. Artif. Intell. 17, 185–203 (1981) 8. Ramsey Faragher, Understanding the basis of the kalman filter via a simple and intuitive derivation. IEEE Signal Process. Mag. 29(5), 128–132 (2012) 9. P. Musoff, H. Zarchan, Fundamentals of Kalman Filtering: A Practical Approach. (American Institute of Aeronautics and Astronautics, Incorporated, 2000). ISBN 1563474557, 9781563474552 10. E. Meinhardt-Llopis, J.S. Pérez, D. Kondermann, Horn-schunck optical flow with a multi-scale strategy. Image Processing On Line, 3, 151–172 (2013). https://doi.org/10.5201/ipol.2013.20 11. H.H. Nagel, Displacement vectors derived from second-order intensity variations in image sequences. Comput. Vis., Graph. Image Process. 21, 85–117 (1983) 12. B.D. Lucas, T. Kanade, An iterative image registration technique with an application to stereo vision, in Proceedings of Imaging Understanding Workshop (1981), pp. 121–130 13. J.L. Barron, D.J. Fleet, S. Beauchemin, Performance of optical flow techniques, Performance of optical flow techniques. Int. J. Comput. Vis. 12(1), 43–77 (1994) 14. T. Brox, A. Bruhn, N. Papenberg, J. Weickert, High accuracy optical flow estimation based on a theory for warping. in Proceedings of the European Conference on Computer Vision, vol. 4 (2004), pp. 25–36 15. M.J. Black, P. Anandan, The robust estimation of multiple motions: parametric and piecewise smooth flow fields. Comput. Vis. Image Underst. 63(1), 75–104 (1996) 16. J. Wang, E. Adelson, Representing moving images with layers. Proc. IEEE Trans. Image Process. 3(5), 625–638 (1994) 17. E. Mémin, P. Pérez, Hierarchical estimation and segmentation of dense motion fields. Int. J. Comput. Vis. 46(2), 129–155 (2002) 18. Simon Baker, Iain Matthews, Lucas-kanade 20 years on: a unifying framework. Int. J. Comput. Vis. 56(3), 221–255 (2004) 19. H.-Y. Shum, R. Szeliski, Construction of panoramic image mosaics with global and local alignment. Int. J. Comput. Vis. 16(1), 63–84 (2000) 20. David G. Lowe, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 91–110 (2004) 21. F. Marino, E. Stella, A. Branca, N. Veneziani, A. Distante, Specialized hardware for real-time navigation. R.-Time Imaging, Acad. Press 7, 97–108 (2001) 22. Stephen T. Barnard, William B. Thompson, Disparity analysis of images. IEEE Trans. Pattern Anal. Mach. Intell. 2(4), 334–340 (1980) 23. M. Leo, T. D’Orazio, P. Spagnolo, P.L. Mazzeo, A. Distante, Sift based ball recognition in soccer images, in Image and Signal Processing, vol. 5099, ed. by A. Elmoataz, O. Lezoray, F. Nouboud, D. Mammass (Springer, Berlin, Heidelberg, 2008), pp. 263–272 24. M. Leo, P.L. Mazzeo, M. Nitti, P. Spagnolo, Accurate ball detection in soccer images using probabilistic analysis of salient regions. Mach. Vis. Appl. 24(8), 1561–1574 (2013) 25. T. D’Orazio, M. Leo, C. Guaragnella, A. Distante, A new algorithm for ball recognition using circle hough transform and neural classifier. Pattern Recognit. 37(3), 393–408 (2004) 26. T. D’Orazio, N. Ancona, G. Cicirelli, M. Nitti, A ball detection algorithm for real soccer image sequences, in Proceedings of the 16th International Conference on Pattern Recognition (ICPR’02), vol. 1 (2002), pp. 201–213 27. Y. Bar-Shalom, X.R. Li, T. Kirubarajan, Estimation with Applications to Tracking and Navigation (Wiley, 2001). ISBN 0-471-41655-X, 0-471-22127-9 28. S.C.S. Cheung, C. Kamath, Robust techniques for background subtraction in urban traffic video. Vis. Commun. Image Process. 5308, 881–892 (2004) 29. C.R. Wren, A. Azarbayejani, T. Darrell, A. Pentland, Pfinder: real-time tracking of the human body. IEEE Trans. Pattern Anal. 19(7), 780–785 (1997)

598

6 Motion Analysis

30. D. Makris, T. Ellis, Path detection in video surveillance. Image Vis. Comput. 20, 895–903 (2002) 31. C. Stauffer, W.E. Grimson, Adaptive background mixture models for real-time tracking, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2 (1999) 32. A Elgammal, D. Harwood, L. Davis, Non-parametric model for background subtraction, in European Conference on Computer Vision (2000), pp. 751–767 33. N.M. Oliver, B. Rosario, A.P. Pentland, A bayesian computer vision system for modeling human interactions. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 831–843 (2000) 34. R. Li, Y. Chen, X. Zhang, Fast robust eigen-background updating for foreground detection, in International Conference on Image Processing (2006), pp. 1833–1836 35. A. Sobral, A. Vacavant, A comprehensive review of background subtraction algorithms evaluated with synthetic and real videos. Comput. Vis. Image Underst. 122(05), 4–21 (2014). https:// doi.org/10.1016/j.cviu.2013.12.005 36. Y. Benezeth, P.M. Jodoin, B. Emile, H. Laurent, C. Rosenberger, Comparative study of background subtraction algorithms. J. Electron. Imaging 19(3) (2010) 37. M. Piccardi, T. Jan, Mean-shift background image modelling. Int. Conf. Image Process. 5, 3399–3402 (2004) 38. B. Han, D. Comaniciu, L. Davis, Sequential kernel density approximation through mode propagation: applications to background modeling, in Proceedings of the ACCV-Asian Conference on Computer Vision (2004) 39. D.J. Heeger, A.D. Jepson, Subspace methods for recovering rigid motion i: algorithm and implementation. Int. J. Comput. Vis. 7, 95–117 (1992) 40. H.C. Longuet-Higgins, K. Prazdny, The interpretation of a moving retinal image. Proc. R. Soc. Lond. 208, 385–397 (1980) 41. H.C. Longuet-Higgins, The visual ambiguity of a moving plane. Proc. R. Soc. Lond. 223, 165–175 (1984) 42. W. Burger, B. Bhanu, Estimating 3-D egomotion from perspective image sequences. IEEE Trans. Pattern Anal. Mach. Intell. 12(11), 1040–1058 (1990) 43. A. Branca, E. Stella, A. Distante, Passive navigation using focus of expansion, in WACV96 (1996), pp. 64–69 44. G. Convertino, A. Branca, A. Distante, Focus of expansion estimation with a neural network, in IEEE International Conference on Neural Networks, 1996, vol. 3 (IEEE, 1996), pp. 1693–1697 45. G. Adiv, Determining three-dimensional motion and structure from optical flow generated by several moving objects. IEEE Trans. Pattern Anal. Mach. Intell. 7(4), 384–401 (1985) 46. A.R. Bruss, B.K.P. Horn, Passive navigation. Comput. Vis., Graph., Image Process. 21(1), 3–20 (1983) 47. M. Armstong, A. Zisserman, P. Beardsley, Euclidean structure from uncalibrated images, in British Machine Vision Conference (1994), pp. 509–518 48. T.S. Huang, A.N. Netravali, Motion and structure from feature correspondences: a review. Proc. IEEE 82(2), 252–267 (1994) 49. S.J. Maybank, O.D. Faugeras, A theory of self-calibration of a moving camera. Int. J. Comput. Vis. 8(2), 123–151 (1992) 50. D.C. Brown, The bundle adjustment - progress and prospects. Int. Arch. Photogramm. 21(3) (1976) 51. W. Triggs, P. McLauchlan, R. Hartley, A. Fitzgibbon, Bundle adjustment - a modern synthesis, in Vision Algorithms: Theory and Practice ed. by W. Triggs, A. Zisserman, R. Szeliski (Springer, Berlin, 2000), pp. 298–375 52. C. Tomasi, T. Kanade, Shape and motion from image streams under orthography: a factorization method. Int. J. Comput. Vis. 9(2), 137–154 (1992) 53. T. Gramegna, L. Venturino, M. Ianigro, G. Attolico, A. Distante, Pre-historical cave fruition through robotic inspection, in IEEE International Conference on Robotics an Automation (2005)

7

Camera Calibration and 3D Reconstruction

7.1 Introduction In Chap. 2 and Sect. 5.6 of Vol.I, we described the radiometric and geometric aspects of an imaging system, respectively. In Sect. 6.7.1, we instead introduced the geometric projection model of the 3D world in the image plane. Let us now see, defined a geometric projection model, the aspects involved in the camera calibration to correct all the sources of geometric distortions introduced by the optical system (radial and tangential distortions, …) and by the digitization system (sensor noise, quantization error, …), often not provided by vision system manufacturers. In various applications, a camera is used to detect metric information of the scene from the image. For example, in the dimensional control of an object it is required to perform certain accurate control measurements, while for a mobile vehicle, equipped with a vision system, it is required to self-locate, that is, estimate its position and orientation with respect to the scene. Therefore, a calibration procedure of the camera becomes necessary, which determines, the relative intrinsic parameters (focal length, the horizontal and vertical dimension of the single photoreceptor of the sensor or the aspect ratio, the size of the sensor matrix, the model coefficients of radial distortion, the coordinates of the main point or the optical center) and the extrinsic parameters. The latter define the geometric transformation to pass from the world reference system to the camera system (the 3 translation parameters and the 3 rotation parameters around the coordinate axes) described in the Sect. 6.7.1. While the intrinsic parameters define the internal characteristic of the acquisition system, independent of the position and attitude of the camera, the extrinsic parameters describe its position and attitude regardless of its internal parameters. The level of accuracy of these parameters defines the accuracy of the level of the derivable measurements from the image. With reference to Sect. 6.7.1, the geometric model underlying the image formation process is described by Eq. (6.211) that we rewrite the following: (7.1) u˜ ≈ P X˜ w © Springer Nature Switzerland AG 2020 A. Distante and C. Distante, Handbook of Image Processing and Computer Vision, https://doi.org/10.1007/978-3-030-42378-0_7

599

600

7 Camera Calibration and 3D Reconstruction

where u˜ indicates the coordinates in the image plane, expressed in pixels (taking into account the position and orientation of the sensor in the image plane), of 3D points with coordinates X˜ w , expressed in the world reference system, and P is the perspective projection matrix, of size 3 × 4, expressed in the most general form: P 3×4 = K 3×3 [R3×3

T 3×1 ]

(7.2)

The matrix P, defined by (7.2), represents the most general perspective projection matrix that includes 1.  The simple canonical perspective transformation defined by the matrix A =  100 0 1 0 (Eq. 6.206) according to the pinhole model; 001 2. The effects of discretization of the image plane associated with the sensor through the matrix K (Eq. 6.208); 3. The geometric transformation that relates the position and of the cam  orientation era with respect to the scene through the matrix M = 0RT T1 (Eq. 6.210). Essentially, in the camera calibration process, the matrix K is the matrix of intrinsic parameters (maximum 5) that models, the pinhole perspective projection of 3D points, expressed in coordinates of the camera (i.e., in 2D image coordinates), and the further transformation needed to take into account the displacement of the sensor in the image plane or the offset of the projection of the principal point on the sensor (see Fig. 6.39). The intrinsic parameters describe the internal characteristics of the camera regardless of its location and position with respect to the scene. Therefore, with the calibration process of the camera all its intrinsic parameters are determined which correspond to the elements of the matrix K . In various applications, it is useful to define the observed scene with respect to an arbitrary 3D reference system instead of the camera reference system (see Fig. 6.39). The equation that performs the transformation from one system and the other is given by the (6.209), where M is the roto-translation matrix of size 4 × 4 whose elements represent the extrinsic (or external) parameters that characterize the transformation between coordinates of the world and camera. This matrix models a geometric relationship that includes the orientation of the camera (through the rotation matrix R of size 3 × 3) and the translation T (3D vector indicating the position of the origin O w with respect to the reference system of the camera) as shown in Fig. 6.39 (in the hypothesis of rigid movement).

7.2 Influence of the Optical System The transformation (7.1) is based only on the pinhole perspective projection model and does not contemplate the effects due to the distortions introduced by the optical system, normally modeled with other parameters that describe the radial and tan-

7.2 Influence of the Optical System

601

gential distortions (see Sect. 4.5 Vol.I). These distortions are very accentuated when using optics with a large angle of view and in low-cost optical systems. The radial distortion generates an internal or external displacement of a 3D point projected in the image plane with respect to its ideal position. It is essentially caused by a defect in the radial curvature of the lenses of the optical system. A negative radial displacement of an image point generates the barrel distortion, that is, it causes more external points to accumulate more and more toward the optical axis, decreasing by a scale factor with decreasing axial distance. A positive radial displacement instead generates the pincushion distortion, that is, it causes more external points to expand by increasing by a factor of scale with the increase of the axial distance. This type of distortion has circular symmetry around the optical axis (see Fig. 4.25 Vol.I). The tangential or decentralized distortion is instead caused by the displacement of the lens center with respect to the optical axis. This error is checked with the coordinates of the principal point (x˜0 , y˜0 ) in the image plane. Experimentally, it has been observed that radial distortion is dominant. Although both optical distortions are generated by complex physical phenomena, they can be modeled with acceptable accuracy with a polynomial function D(r ), where the variable r is the radial distance of the image points (x, ˜ y˜ ) (obtained with the ideal pinhole projection, see Eq. 6.206) from the principal point (u 0 , v0 ). In essence, the optical distortions influence the pinhole perspective projection coordinates and can be corrected before or after the transformation in the sensor coordinates, in relation to the camera’s chosen calibration method. Therefore, having obtained the projections with distortions (x, ˜ y˜ ) of the 3D points of the world in the image plane with the (6.206), the corrected coordinates by radial distortions, indicated with (x˜c , y˜c ) are obtained as follows: x˜c = x˜0 + (x˜ − x˜0 ) · D(r, k)

y˜c = y˜0 + ( y˜ − y˜0 ) · D(r, k)

(7.3)

where (x˜0 , y˜0 ) is the principal point and D(r ) is the function that models the nonlinear effects of radial distortion, given by D(r, k) = k1r 2 + k2 r 4 + k3r 6 + · · · with r=



(x˜ − x˜0 )2 + ( y˜ − y˜0 )2

(7.4)

(7.5)

The truncation of terms with powers of r greater than the sixth degree make a negligible contribution and can be assumed null. Experimental tests have shown that 2–3 coefficients ki are sufficient to correct almost 95% of the radial distortion for a medium quality optical system. To include also the tangential distortion that attenuates the effects of the lens decentralization, the tangential correction component is added to Eq. (7.3), obtaining the following equations:    

 x˜c = x˜0 + (x˜ − x˜0 ) k1 r 2 + k2 r 4 + k3 r 6 + · · · + p1 r 2 + 2(x˜ − x˜0 )2 + 2 p2 (x˜ − x˜0 )( y˜ − y˜0 ) 1 + p3 r 2 + · · ·

(7.6)

602

7 Camera Calibration and 3D Reconstruction

     y˜c = y˜0 + ( y˜ − y˜0 ) k1 r 2 + k2 r 4 + k3 r 6 + · · · + 2 p1 (x˜ − x˜0 )( y˜ − y˜0 ) + p2 (r 2 + 2( y˜ − y˜0 )2 ) 1 + p3 r 2 + · · ·

(7.7) As already mentioned, the radial distortion is dominant and is characterized with this approximation with the coefficients (ki , i = 1, 3) while the tangential distortion is characterized by the coefficients ( pi , i = 1, 3), all of which can be obtained through a process of ad hoc calibration by projecting sample patterns on a flat surface, for example, grids with vertical and horizontal lines (or other types of patterns) and then acquiring distorted images. Note the geometry of the projected patterns and by calculating the coordinates of the patterns distorted with Eqs. (7.6) and (7.7), we can find the optimal coefficients of the distortions through a nonlinear equation system and a nonlinear regression method.

7.3 Geometric Transformations Involved in Image Formation Before analyzing the calibration methods of an acquisition system, we summarize, with reference to Sect. 6.7.1, the geometric transformations involved in the whole process of image formation (see Fig. 7.1). 1. Transformation from the world’s 3D coordinates (X w , Yw , Z w ) to the camera’s 3D coordinates (X, Y, Z ) through the rotation matrix R and the translation vector T of the coordinate axes according to Eq. 6.209 which, expressed in homogeneous coordinates in compact form, results: X˜ = M X˜ w where M is the roto-translation matrix whose elements represent the extrinsic parameters.

Q(X,Y,Z)

(a)

(b)

x q(x,y)

Yw O

(c)

xc

q(x,y)

q(xc,yc)

Zw

u

Xw y

yc

v

Fig. 7.1 Sequence of projections: 3D → 2D → 2D → 2D. a Pinhole perspective projection of 3D points projected in the 2D image plane (in the camera’s reference system) expressed in normalized coordinates given by Eq. (6.205); b 2D transformation in the image plane of the normalized coordinates (x, y) mapped in coordinate (xc , yc ), corrected by radial and tangential distortions; c Further 2D transformation (affine) according to the camera’s intrinsic parameter matrix to express the coordinates, corrected by the distortions, in pixel coordinates (u, v) of the sensor

7.3 Geometric Transformations Involved in Image Formation

603

2. Pinhole perspective projection of the 3D points X = (X, Y, Z ) from the camera’s reference system to the 2D image plane in coordinates x = (x, y), given by Eq. 6.206 which, expressed in homogeneous coordinates in compact form, results in the following: x˜ = A X˜ w where A is the canonical perspective projection matrix. The image coordinates are not distorted since they are ideal perspective projections (we assume focal length f = 1). 3. Apply Eqs. 7.3 to the coordinates x˜ (obtained in the previous step with the pinhole model) influenced by the dominant radial distortion thus obtaining the new correct coordinates x˜ c in the image plane. 4. Apply the affine transformation given by (6.208): u˜ = K x˜ c and characterized by the calibration matrix of the camera K (see Sect. 6.7.1), to switch between the reference system of the image plane (coordinates x˜ c = (x˜c , y˜c ) corrected by radial distortions) to the sensor reference system with the new coordinates, expressed in pixels and indicated u˜ = (u, ˜ v˜ ). Recall that the 5 elements of the triangular calibration matrix K are the intrinsic parameters of the camera.

7.4 Camera Calibration Methods Camera calibration is a strategic procedure in different applications, where a vision machine can be used for 3D reconstruction of the scene, to make measurements on observed objects, and several other activities. For these purposes, it becomes strategic for the calibration process to detect the intrinsic and extrinsic parameters of a vision machine with adequate accuracy. Although the camera manufacturers give the level of accuracy of the camera’s internal parameters, these must be verified through the acquisition of ad hoc images using calibration platforms of which the geometry of the reference fiducial patterns is well known. The various calibration processes can capture multiple images of a flat calibration platform observed from different points of view, or capture a single image of a platform with at least two different planes of known patterns (see Fig. 7.2), or platform with linear structures at various inclinations. By acquiring these platforms from various points of view with different calibration methods, it is possible to estimate the intrinsic and extrinsic parameters. Unlike the previous methods, a camera autocalibration approach avoids the use of a calibration platform and determines the intrinsic parameters that are consistent with the geometry of a given image set [1]. This is possible by finding a sufficient number of pattern matches between at least three images to recover both intrinsic

604

7 Camera Calibration and 3D Reconstruction

(a)

(b)

(c)

Y

Y

Y

Z

Z

X Z

X

X

Fig. 7.2 Calibration platforms. a Flat calibration platform observed from different view points; b Platform having at least two different planes of patterns observed from a single view; c Platform with linear structures

and extrinsic parameters. Since the autocalibration is based on the pattern matches determined between the images, it is important that with this method the detection of the corresponding patterns is accurate. With this approach, the distortion of the optical system is not considered. In the literature, calibration methods based on vanishing points are proposed using parallelism and orthogonality between the lines in 3D space. These approaches rely heavily on the process of detecting edges and linear structures to accurately determine vanishing points. Intrinsic parameters and camera rotation matrix can be estimated from three mutually orthogonal vanishing points [2]. The complexity and accuracy of the calibration algorithms also depend on the need to know all the intrinsic and extrinsic parameters and the need to remove optical distortions or not. For example, in some cases, it may not be required to use methods that do not require the estimation of the focal length f and of the location of the principal point (u 0 , v0 ) since only the relationship between the coordinates of the world’s 3D points and its 2D coordinates in the image plane is required. The basic approaches of calibration derive from photogrammetry which solves the problem by minimizing a nonlinear error function to estimate the geometric and physical parameters of a camera. Less complex solutions have been proposed for the computer vision by simplifying the camera model using linear and nonlinear systems and operating in real time. Tsai [3] has proposed a two-stage algorithm that will be described in the next section and a modified version of this algorithm [4] has a four-stage extension including the Direct Linear Transformation—DLT method in the two stages. Other calibration methods [5] are based on the estimation of the perspective projection matrix P by acquiring an image of calibration platforms with noncoplanar patterns (at least two pattern planes as shown in Fig. 7.2b) and selecting at least 6 3D points and automatically detecting the 2D coordinates of the relative projections in the image plane. Zhang [6] describes instead an algorithm that requires at least two images acquired from different points of view of a flat pattern platform (see Fig. 7.2a, c).

7.4 Camera Calibration Methods

605

7.4.1 Tsai Method The Tsai method [3], also called direct parameter calibration (i.e., it directly recovers the intrinsic and extrinsic parameters of the camera), uses as calibration platform two orthogonal planes with black squares patterns on a white background equally spaced. Of these patterns, we know all the geometry (number, size, spacing, …) and their position in the 3D reference system of the world, integral with the calibration platform (see Fig. 7.2b). The acquisition system (normally a camera) is placed in front of the platform and through normal image processing algorithms all the patterns of the calibration platform are automatically detected and the position of each pattern in the reference system of the camera is determined in the image plane. It is then possible to find the correspondence between the 2D patterns of the image plane and the visible 3D patterns of the platform. The relationship between 3D world coordinates X w = [X w Yw Z w ]T and 3D coordinates of the camera X = [X Y Z ]T in the context of rigid roto-translation transformation (see Sect. 6.7) is given by Eq. 6.209 which when rewritten in explicit matrix form results results in the following: ⎡ ⎤⎡ ⎤ ⎡ ⎤ r11 r12 r13 Xw Tx (7.8) X = RX w + T = ⎣r21 r22 r23 ⎦ ⎣ Yw ⎦ + ⎣Ty ⎦ r31 r32 r33 Zw Tz where R is the rotation matrix of size 3 × 3 and T is the 3D translation vector indicating the position of the origin O w with respect to the camera reference system (see Fig. 6.39), which schematizes more generally the projection of 3D points of a rigid body in the world reference system X w = [X w Yw Z w ]T and in the camera reference system X = [X Y Z ]T ). In essence, (7.8) projects a generic point Q of the scene (in this case, it would be a pattern of the calibration platform) between the two 3D spaces, the one in the reference system of the platform and that of the camera. From (7.8), we explicitly get the 3D coordinates in the camera reference system given by X = r11 X w + r12 Yw + r13 Z w Y = r21 X w + r22 Yw + r23 Z w (7.9) Z = r31 X w + r32 Yw + r33 Z w The relation between 3D coordinates of the camera X = [X Y Z ]T to the 2D image coordinates, expressed in pixels u = (u, v), is given by Eq. 6.207. Substituting in these the Eqs. 7.9, we obtain the following relations to pass from the coordinates of the world X w = [X w Yw Z w ]T to the images coordinates expressed in pixels: X r11 X w + r12 Yw + r13 Z w + Tx = αu Z r31 X w + r32 Yw + r33 Z w + Tz r21 X w + r22 Yw + r23 Z w + Ty Y v − v0 = αv = αv Z r31 X w + r32 Yw + r33 Z w + Tz

u − u 0 = αu

(7.10)

606

7 Camera Calibration and 3D Reconstruction

where αu and αv are the horizontal and vertical scale factors expressed in pixels, and u0 = [u 0 v0 ]T is the sensor’s principal point. The parameters αu , αv , u 0 and v0 represent the intrinsic (or internal) parameters of the camera that together with the extrinsic parameters (or exter nal) R and T are the unknown parameters. Normally the pixels are squares for which αu = αv = α, and α is considered as the focal length of the optical system expressed in units of the pixel size. The sensor skew parameter s, neglected in this case (assumed zero), is a parameter that takes into account the non-rectangularness of the pixel area (see Sect. 6.7). This method assumes known the image coordinates (u 0 , v0 ) of the principal point (normally assumed as the center of the image-sensor) and consider the unknowns the parameters αu , αv , R and T . To simplify, let’s assume (u 0 , v0 ) = (0, 0) and denote with (u i , vi ) the ith projection in the image plane of the 3D calibration patterns. Thus the first members of the previous equations become, respectively, u i − 0 = u i and vi − 0 = vi . After these simplifications, dividing member to member this projection equations between them, considering only the first and third members of (7.10), for each ith projected pattern we get the following equation: u i αv (r21 X w(i) + r22 Yw(i) + r23 Z w(i) + Ty ) = vi αu (r11 X w(i) + r12 Yw(i) + r13 Z w(i) + Tx )

(7.11)

If we now divide both members of (7.11) by αv , let’s indicate with α = ααuv , which we remember to be the aspect ratio of the pixel, and we define the following symbols: ν1 = r21 ν2 = r22 ν3 = r23 ν4 = Ty

ν5 = αr11 ν6 = αr12 ν7 = αr13 ν8 = αTx

(7.12)

then (7.11) can be rewritten in the following form: u i X w(i) ν1 + u i Yw(i) ν2 + u i Z w(i) ν3 + u i ν4 − vi X w(i) ν5 − vi Yw(i) ν6 − vi Z w(i) ν7 − vi ν8 = 0

(7.13)  (i) (i) (i)  For the N known projections (X w , Yw , Z w ) → (u i , vi ), i = 1, . . . , N , we have N equations by obtaining a system of linear equations in the 8 unknowns ν = (ν1 , ν2 , . . . , ν8 ), given by Aν = 0 (7.14) where:



(1)

(1)

(1)

u 1 X w u 1 Yw u1 Zw ⎢ ⎢ u 2 X w(2) u 2 Yw(2) u 2 Z w(2) A=⎢ ··· ··· ⎣ ··· u N X w(N ) u N Yw(N ) u N Z w(N )

(1)

(1)

(1)

u 1 −v1 X w −v1 Yw −v1 Z w (2) (2) (2) u 2 −v2 X w −v2 Yw −v2 Z w ··· ··· ··· ··· u N −v N X w(N ) −v N Yw(N ) −v N Z w(N )

⎤ −v1 ⎥ −v2 ⎥ ⎥ ··· ⎦ −v N

(7.15)

It is shown [7] that if N ≥ 7 and the points are not coplanar then the matrix A has rank 7 and there exists a nontrivial solution which is the eigenvector corresponding to the zero eigenvalue of AT A. In essence, the system can be solved by f actori zation of the matrix A with the SVD approach (Singular Value Decomposition), that is, A = UV T , where we remember that the diagonal elements of  are the singular values of A and the solution of the system has solution not trivial ν which is proportional

7.4 Camera Calibration Methods

607

to the column vector of V corresponding to the smallest singular value of A. This solution corresponds to the last column vector of V , which we can indicate with ν¯ , which unless a factor of proportionality κ can be a solution ν of the system, given by ν = κ ν¯ or ν¯ = γ ν (7.16) where γ = 1/κ. According to the symbols reported in (7.12), using the last relation of (7.16) assuming that the solution is given by the eigenvector ν¯ , we have (¯ν1 , ν¯ 2 , ν¯ 3 , ν¯ 4 , ν¯ 5 , ν¯ 6 , ν¯ 7 , ν¯ 8 ) = γ (r21 , r22 , r23 , Ty , αr11 , αr12 , αr13 , αTx ) (7.17) Now let’s see how to evaluate the various parameters involved in the previous equation. Calculation of α and γ . These parameters can be calculated considering that the rotation matrix by definition is an orthogonal matrix1 therefore, according to (7.17), we can set the following relations:     2 2 2 ||(¯ν1 , ν¯ 2 , ν¯ 3 )||2 = ν¯ 12 + ν¯ 22 + ν¯ 32 = γ 2 r21 + r22 + r23 = |γ | (7.18)    =1

||(¯ν5 , ν¯ 6 , ν¯ 7 )||2 =



ν¯ 52 + ν¯ 62 + ν¯ 72 =



 2  2 2 γ 2 α 2 r11 + r12 + r13 = α|γ | (7.19)    =1

1 By

definition, an orthogonal matrix R has the following properties:

1. R is nor mali zed, that is, the sum of the squares of the elements of any row or column is 1 ( ri 2 = 1, i = 1, 2, 3). 2. The inner product of any pair of rows or columns is zero (r1T r3 = r2T r3 = r2T r1 = 0). 3. From the first two properties follows the orthonormality of R, or R R T = R T R = I, its invertibility, that is, R −1 = R T (its inverse coincides with the transpose), and the determinant of R has module 1 (det (R) = 1), where I indicates the identity matrix. 4. The rows of R represent the coordinates in the space of origin of unit vectors along the coordinate axes of the rotated space. 5. The columns of R represent the coordinates in the rotated space of unit vectors along the axes of the space of origin. In essence, they represent the directional cosines of the triad axes rotated with respect to the triad of origin. 6. Each row and each column are orthonormal to each other, as they are orthonormal bases of space. It, therefore, follows that given two vectors row or column of the matrix r 1 and r 2 it is possible to determine the third base vector as a vector product given by r 3 = r 1 × r 2 . .

608

7 Camera Calibration and 3D Reconstruction

The scale factor γ is determined less than the sign with (7.18) while by definition the aspect ratio α > 0 is calculated with (7.19). Explicitly α can be determined as α = ||(¯ν5 , ν¯ 6 , ν¯ 7 )||2 /||(¯ν1 , ν¯ 2 , ν¯ 3 )||2 . At this point, known as α and the module |γ |, knowing the solution vector ν¯ , we can calculate the rest of the parameters even without the sign, since the sign of γ is not known. Calculation of Tx and Ty . Always considering the estimated vector ν¯ for (7.17), we have ν¯ 8 ν¯ 4 Tx = Ty = (7.20) α|γ | |γ | determined even less than the sign. Calculation of the rotation matrix R. The elements of the first two rows of the rotation matrix are given by (r11 , r12 , r13 ) =

(¯ν5 , ν¯ 6 , ν¯ 7 ) |γ |α

(r21 , r22 , r23 ) =

(¯ν1 , ν¯ 2 , ν¯ 3 ) |γ |

(7.21)

determined even less than the sign. The elements of the third row r 3 , which we remember not included in system (7.14), due to the orthonormality characteristics of the matrix R are calculated considering the outer product of the vectors r 1 = (r11 , r12 , r13 ) and r 2 = (r21 , r22 , r23 ), associated with the first two rows of R calculated with the previous equations. And it is as follows: ⎤ ⎡ ⎤ ⎡ r12 r23 − r13r22 r31 (7.22) r 3 = r 1 × r 2 = ⎣r32 ⎦ = ⎣r13r21 − r11r23 ⎦ r33 r11r22 − r12 r21 Verification of the orthogonality of the rotation matrix R. Recall that the elements of R have been calculated starting from the estimated eigenvector ν¯ , possible solution of system (7.14), obtained with the SVD approach. Therefore, it is necessary to verify if the computed matrix R satisfies the properties of an orthogonal matrix where the rows and the columns constitute the characteristics of orthonormal bases, ˆ the estimate of R. The orthogonality of R ˆ can ˆR ˆ T = I considering R that is, R ˆ = U I V T , where the be imposed again using the SVD approach by calculating R diagonal matrix  of the singular values is replaced with the identity matrix I. Determine the sign of γ and calculate Tz , αu and αv . The sign of γ is determined by checking the sign of u and X , and the sign of v and of Y (see Eq. 7.10) which must be consistent with the signs in agreement also to the geometric configuration of the camera and calibration platform (see Fig. 6.39). Observing Eq. 7.10, we have that the two members are positive. If we replace in these equations the coordinates (u, v) of the image plane and the coordinates (X, Y, Z ) of the camera reference system of a generic point used for the calibration, we could analyze the sign of Tx , Ty and the sign of the elements of r 1 and r 2 . In fact, we know that αu and αv are positive as well as Z (that is, the denominator of Eq. 7.10) considering the origin of the camera reference system and the position of the camera itself (see

7.4 Camera Calibration Methods

609

Fig. 6.39). Therefore, there must be concordance of signs between the first member and numerator for the two equations, that is, the following condition must occur: u(r11 X w + r12 Yw + r13 Z w + Tx ) > 0 v(r21 X w + r22 Yw + r23 Z w + Ty ) > 0

(7.23)

If these conditions are satisfied, the sign of Tx , Ty is positive and the elements of r 1 and r 2 are left unchanged together with the translation values Tx , Ty . Otherwise the signs of these parameters are reversed. Determined the parameters R, Tx , Ty , α remain to estimate Tz and αu (we remember from the definition of aspect ratio α that αu = ααv ). Let’s reconsider the first equation (7.10) and rewrite it in the following form: u(r31 X w + r32 Yw + r33 Z w + Tz ) = αu (r11 X w + r12 Yw + r13 Z w + Tx )   (7.24) where Tz and αu are the unknowns. We can then determine Tz and αu by setting and solving a system of equations, similar to the system (7.14), obtained by considering again the N calibration points, as follows:   T (7.25) M z =b αu where



⎤ (1) (1) (1) u 1 (r11 X w + r12 Yw + r13 Z w + Tx ) ⎢ ⎥ ⎢ u (r11 X w(2) + r12 Yw(2) + r13 Z w(2) + Tx ) ⎥ M=⎢ 2 ⎥ ··· ⎣· · · ⎦ (N ) (N ) (N ) u N (r11 X w + r12 Yw + r13 Z w + Tx )

(7.26)



⎤ −u 1 (r31 X w(1) + r32 Yw(1) + r33 Z w(1) + Tx ) ⎢ ⎥ ⎢ −u 2 (r31 X w(2) + r32 Yw(2) + r33 Z w(2) + Tx ) ⎥ b=⎢ ⎥ ··· ⎣ ⎦ (N ) (N ) (N ) −u N (r31 X w + r32 Yw + r33 Z w + Tx )

(7.27)

The solution of this overdetermined linear system is possible with the SVD or pseudo-inverse methods of Moore–Penrose, obtaining   Tz = (M T M)−1 M T b (7.28) αu Knowing αu we can calculate αv as follows, αv = ααu .

610

7 Camera Calibration and 3D Reconstruction

Fig. 7.3 Orthocenter method for calculating the optical center through the three vanishing points generated by the intersection of the edges of the same color

The coordinates of the principal point u 0 and v0 can be calculated by virtue of the orthocenter theorem.2 If in the image plane a triangle is defined by three vanishing points generated by three groups of parallel and orthogonal lines in the 3D space (i.e., the vanishing points correspond to 3 orthogonal directions of the 3D world), then the principal point coincides with the or thocenter of the triangle. The same calibration platform (see Fig. 7.3) can be used to generate the three vanishing points using pairs of parallel lines present in the two orthogonal planes of the same platform [2]. It is shown that from 3 finite vanishing points we can estimate the focal length f and the principal point (u 0 , v0 ). The accuracy of the u 0 and v0 coordinates improves if you calculate the principal point by looking at the calibration platform from multiple points of view and then averaging the results. The camera calibration can be performed using the vanishing points, which on the one hand eliminate the problem of finding the correspondences but have the disadvantage of the problem of the presence of infinite vanishing points and the inaccuracy of the calculation of the same.

7.4.2 Estimation of the Perspective Projection Matrix Another method to determine the parameters (intrinsic and extrinsic) of the camera is based on the estimation of the perspective projection matrix P (always assuming the pinhole projection model) defined by (7.2) which transform, with Eq. (7.1), 3D points of the calibration platform (for example, a cube of known dimensions with faces having a checkerboard pattern also of known size), expressed in world homogeneous ˜ the last expressed in pixels. coordinates X˜ w , in image homogeneous coordinates u, In particular, the coordinates (X i , Yi , Z i ) of the corners of the board are assumed to be known and the coordinates (u i , vi ) in pixels of the same corners projected in the image plane are automatically determined. The relation, which relates the coordinates of the 3D points with the coordinates of the relative projections in the 2D image plane, is given by Eq. (7.1) that rewritten in extended matrix form is

2 Triangle

Orthocenter: located at the intersection of the heights of the triangle.

7.4 Camera Calibration Methods

611

⎡ ⎤ ⎡ ⎤ ⎡ ⎡ ⎤ ⎤ Xw Xw p u p p p 11 12 13 14 ⎢ ⎥ ⎢ ⎥ ⎣ v ⎦ = P ⎢ Yw ⎥ = ⎣ p21 p22 p23 p24 ⎦ ⎢ Yw ⎥ ⎣ Zw ⎦ ⎣ Zw ⎦ 1 p31 p32 p33 p34 1 1

(7.29)

From (7.29), we can get three equations, but dividing the first two by the third equation, we have two equations, given by p11 X w + p31 X w + p21 X w + v= p31 X w +

u=

p12 Yw + p32 Yw + p22 Yw + p32 Yw +

p13 Z w + p33 Z w + p23 Z w + p33 Z w +

p14 p34 p24 p34

(7.30)

from which we have 12 unknowns that are precisely the elements of the matrix P. Applying these equations for the ith correspondence 3D → 2D relative to an corner of the cube, and placing them in the following linear form, we obtain p11 X w(i) + p12 Yw(i) + p13 Z w(i) + p14 − p31 u i X w(i) − p32 u i Yw(i) − p33 u i Z w(i) − p34 = 0 p21 X w(i) + p22 Yw(i) + p23 Z w(i) + p24 − p31 vi X w(i) − p32 vi Yw(i) − p33 vi Z w(i) − p34 = 0

(7.31)

we can rewrite them in the compact homogeneous linear system of 12 equations considering that they need at least N = 6 points to determine the 12 unknown elements of the matrix P: Ap = 0 (7.32) where ⎡

X w(1) Yw(1) Z w(1) ⎢ 0 0 ⎢ 0 ⎢ A = ⎢ ··· ··· ··· ⎢ (N ) (N ) (N ) ⎣ X w Yw Z w 0 0 0

1 0 0 0 (1) (1) (1) 0 X w Yw Z w ··· ··· ··· ··· 1 0 0 0 (N ) (N ) (N ) 0 X w Yw Z w

0 1 ··· 0 1

−u 1 X w(1) (1) −v1 X w ··· (N ) −u N X w (N ) −v N X w

−u 1 Yw(1) (1) −v1 Yw ··· (N ) −u N Yw (N ) −v N Yw

−u 1 Z w(1) (1) −v1 Z w ··· (N ) −u N Z w (N ) −v N Z w

−u 1 −v1 ··· −u N −v N

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

(7.33) and

T p = p1 p2 p3 p4

(7.34)

with pi , i = 1, 4 representing the rows of the solution matrix p of the system (7.32) and 0 represents the zero vector of length 2N . If we use only N = 6 points, we have a homogeneous linear system with 12 equations and 12 unknowns (to be determined). The points (X i , Yi , Z i ) are projected in the image plane in (u i , vi ) according to the pinhole projection model. From algebra, it is known that a homogeneous linear system admits the trivial solution p = 0, that is, the vector p lying in the null space of the matrix A and we are not interested in this trivial solution. Alternatively, it is shown that (7.32) admits infinite solutions if and only if the rank of A is less than 2N , the number of unknowns.

612

7 Camera Calibration and 3D Reconstruction

In this context, the elements of the matrix A can be reduced to 11 if each element is divided with any of the elements themselves (for example, the element p34 ) thus obtaining only 11 unknowns (with p34 = 1). Therefore, it would result that rank( A) = 11 is less than 2N = 12 unknowns and the homogeneous system admits infinite solutions. One of the possible nontrivial solutions, in the space of homogeneous system solutions, can be determined with the SVD method where A = UV T and the solution would be the column vector of V corresponding to the zero singular value of the matrix A. In fact, a p solution is the eigenvector that corresponds to the smallest eigenvalue of AT A unless a proportionality factor. If we denote with p¯ the last column vector of V , which is less than a proportionality factor κ can be a solution of p of the homogeneous system, we have p = κ p¯

or

p¯ = γ p

(7.35)

where γ = 1/κ. Given a solution p¯ of the projection matrix P, even if less than γ , we can now find an estimate of the intrinsic and extrinsic parameters using the elements of matrix p¯ . For this purpose, two decomposition approaches of the matrix P are presented, one based on the equation of the perspective projection matrix and the other on the Q R f actori zation.

7.4.2.1 From the Matrix P to the intrinsic and extrinsic parameters Now let’s rewrite in an extended way Eq. (6.212) of the perspective projection matrix by expliciting the parameters: ⎤ αu r11 + u 0 r31 αu r12 + u 0 r32 αu r13 + u 0 r33 αu Tx + u 0 Tz ⎣ T ] = αv r21 + v0 r31 αv r22 + v0 r32 αv r23 + v0 r33 αv Ty + v0 Tz ⎦ r31 r32 r33 Tz ⎡

P = K [R

(7.36)

where we recall that K is the triangular calibration matrix of the camera given by (6.208), R is the rotation matrix and T the vector translation. Before deriving the calibration parameters, considering that the solution obtained is less than a scale factor, it is better to normalize p¯ also to avoid a trivial solution of the type p¯ = 0. The normalization by setting p34 = 1 can cause singularity if the value of p34 is close to zero. The alternative is to normalize by imposing the constraint of the unit vector r 3T , that is, p¯ 31 + p¯ 32 + p¯ 33 = 1 which eliminates the singularity problem [5]. In analogy to what was done in the previous paragraph with the direct calibration method (see Eq. 7.18), we will use γ to normalize p¯ . Looking from (7.36) that the first three elements of the third line of P correspond to the vector r 3 of the rotation matrix R, we have  2 + p¯ 2 + p¯ 2 = |γ | p¯ 31 (7.37) 32 33 At this point, dividing the solution vector p¯ for |γ | is normalized to less than the sign. For simplicity, we define the following intermediate vectors q 1 , q 2 , q 3 , q 4 , on the matrix solution found, as follows:

7.4 Camera Calibration Methods



p¯ 11 ⎢ ⎢ ⎢ ⎢ p¯ 21 ⎢ p¯ = ⎢ ⎢ ⎢ ⎢ p¯ 31 ⎣

613

⎤ p¯ 12 p¯ 13 p¯ 14   ⎥ ⎡ T ⎤ q ⎥ q 1T ⎥ ⎢ 1 ⎥ ⎥ p¯ 22 p¯ 23 p¯ 24 ⎥ ⎢ ⎥ T q ⎥   q ⎥=⎢ 4 2 ⎢ ⎥ ⎥ ⎣ q 2T ⎦ ⎥ p¯ 32 p¯ 33 p¯ 34 ⎥ T q ⎦    3 q 3T

(7.38)

q4

We can now determine all the intrinsic and extrinsic parameters by analyzing the elements of the projection matrix given by (7.36) and that of the estimated parameters given by (7.38) according to the intermediate vectors q i . The p¯ is an approximation of the P matrix. Unless of the sign we can get Tz (the term p¯ 34 of the matrix 7.36) and the elements r3i of the third row of R, associating them, respectively, with the corresponding term p¯ 34 of (7.38) (i.e., the term q¯43 ) and the elements p¯ 3i : Tz = ± p¯ 34

r3i = ± p¯ 3i , i = 1, 2, 3

(7.39)

The sign is determined by examining the equality Tz = ± p¯ 34 and knowing if the origin of the world reference system is positioned in front of or behind the camera. If forward Tz > 0, and therefore the sign must agree with p¯ 34 , otherwise Tz < 0 the sign must be opposite to that of p¯ 34 . According to the properties of the rotation matrix (see Note 1) the other parameters are calculated as follows: u 0 = q 1T q 3 αu =



v0 = q 2T q 3 q 1T q 1 − u 20

(inner pr oduct) αv =



q 2T q 2 − v02

(7.40)

(7.41)

and assuming the positive sign (with Tz > 0) we have r1i = ( p¯ 1i − u 0 p¯ 3i )/αu Tx = ( p¯ 14 − u 0 Tz )/αu

r2i = ( p¯ 2i − v0 p¯ 3i )/αv Ty = ( p¯ 24 − v0 Tz )/αv

(7.42)

where i = 1, 2, 3. It should be noted that the parameter s of skew has not been determined assuming the rectangularity of the sensor pixel area that can be determined with other nonlinear methods. The accuracy level of P calculated depends very much on the noise present on the starting data, that is, on the projections (X i , Yi , Z i ) → (u i , vi ). This can be verified by ensuring that the rotation matrix R maintains the orthogonality constraint with the det (R) = 1. An alternative approach, to recover the intrinsic and extrinsic parameters from the estimated projection matrix P, is based on its decomposition into two submatrices B and b, where the first is obtained by considering the first 3 × 3 elements of P,

614

7 Camera Calibration and 3D Reconstruction

while the second represents the last column of P. Therefore, we have the following breakdown: ⎤ ⎡ p11 p12 p13 p14 ⎢ p21 p22 p23 p24 ⎥

⎥ b (7.43) P ≡⎢ ⎣ p31 p32 p33 p34 ⎦ = B     B

b

From (7.2), we have P decomposed in terms of intrinsic parameters K and extrinsic parameter R and T :



P=K R T =K R − RC (7.44) so by virtue of (7.43), we can write the following decompositions: B = KR

b = KT

(7.45)

It is pointed out that T = −RC expresses the translation vector of the origin of the world reference system to the camera system or the position of the origin of the world system in the camera coordinates. According to the decomposition (7.43), considering the first equation of (7.45) and the intrinsic parameters defined with the matrix K (see Eq. 6.208), we have ⎤ αu2 + s 2 + u 20 sαv + u 0 v0 u 0 2 2 ⎣ = sαv + u 0 v0 αv + v0 v0 ⎦ u0 v0 1 ⎡

B ≡ B B = K K R R = KK T

T

T

I

T

(7.46)

where R R T = I is motivated by the orthogonality of the rotation matrix R. We also know that P is defined less than a scale factor and the last element is not normally 1. It, therefore, needs to normalize the matrix B , obtained with (7.46), so that the last element B 33 is 1. Next, we can derive the intrinsic parameters as follows: u 0 = B 13  αv = B22 − v02

αu =

v0 = B 23 s=

B 12 − u 0 v0 αv

 B 11 − u 20 − s 2

(7.47)

(7.48)

(7.49)

The previous assignments are valid if αu > 0 and αv > 0. The extrinsic parameters, i.e., the rotation matrix R and the translation vector T , once the intrinsic parametric parameters are known (i.e., the matrix K known with the previous assignments), we can calculate them with Eq. (7.45), as follows: R = K −1 B

T = K −1 b

(7.50)

7.4 Camera Calibration Methods

615

7.4.2.2 Intrinsic and Extrinsic Parameters with the Q R Decomposition of the Matrix P From linear algebra, we define QR decomposition or QR factorization of a square matrix A, the product A = Q R where Q is an orthogonal matrix (i.e., its columns are orthogonal unit vectors resulting in Q T Q = I) and R is an upper triangular matrix [8]. If A is invertible, then the decomposition is unique if the diagonal elements of R are positive. To avoid confusion, we immediately point out that the matrix R indicated in the Q R decomposition definition is not the rotation matrix R considered so far in the text. In this context, we want to use the QR decomposition to find K , R and T from the projection matrix P given by Eq. (7.44). For this purpose, consider the matrix B defined in (7.43) which in fact is the submatrix of P consisting of the 3 × 3 elements in the upper left which for (7.43) we know to correspond to B = K R. Therefore, executing the QR decomposition on the submatrix B −1 would have B −1 = Q L

(7.51)

where by definition Q is the orthogonal matrix and L the upper triangular matrix. From (7.51), it is immediate to derive the following: B = L −1 Q −1 = L −1 Q T

(7.52)

which shows that K = L −1 and R = Q T . After the decomposition the matrix K might not have the last element K 33 = 1 as expected by Eq. 6.208. Therefore, it is necessary to normalize K (and also P) by dividing all the elements by K 33 . However, it is noted that this does not change the validity of the calibration parameters because the scale is arbitrary. Subsequently, with the second equation of (7.43) we calculate the translation vector T given by T = K −1 b

(7.53)

remembering that b is the last column of the matrix P. Alternatively, it could be used RQ decomposition that decomposes a matrix A into the product A = R Q, where R is an upper triangular matrix and Q is an orthogonal matrix. The difference with the QR decomposition consists only in the order of the two matrices.

7.4.2.3 Nonlinear Estimation of the Perspective Projection Matrix In the previous section, we have estimated the calibration matrix P using the least squares approach, which in fact minimizes a geometric distance || A p||22 , subject to the constraint || p||2 = 1 where we remember that p is the column vector of 2 × 1 elements. Basically it does not take into account the physical meaning of the camera calibration parameters. An approach closer to the physical reality of the parameters is based on the maximum likelihood estimation that directly minimizes an error function obtained as the sum of the distance between points in the image plane of

616

7 Camera Calibration and 3D Reconstruction

which the position (u i , vi ) is known and their predicted position from the perspective projection equation (7.30). This error function is given by N  (i)  p X + u i − 11 w (i) p p31 X w + i=1

min

  (i) (i) (i) p12 Yw + p13 Z w + p14 2 p X + + vi − 21 w (i) (i) (i) p32 Yw + p33 Z w + p34 p31 X w +

  (i) (i) p22 Yw + p23 Z w + p24 2 (i)

(i)

p32 Yw + p33 Z w + p34

(7.54) where N is the number of correspondences (X w , Yw , Z w ) → (u, v) which are assumed to be affected by independent and identically distributed (iid) random noise. The error function (7.54) is nonlinear and can be minimized using the Levenberg– Marquardt minimization algorithm. This algorithm is applied starting from initial values of p calculated with the linear least squares approach described in Sect. 7.4.2.

7.4.3 Zhang Method This method [6,9] uses a planar calibration platform, that is, a flat chessboard (see Fig. 7.2a) observed from different points of view or by keeping the camera position fixed, changes position and attitude of the chessboard. The 3D points of the chessboard are automatically localized in the image plane (with the known algorithms of corner detector, for example, that of Harris) of which the geometry is known and detected the correspondences (X w , Yw , 0) → (u, v). Without losing generality, the world’s 3D reference system assumes that the chessboard plane is on Z w = 0. Therefore, all the 3D points lying in the chessboard plane have the third coordinate Z w = 0. If we denote by r i the columns of the rotation matrix R, we can rewrite the projection relation (7.1) of the correspondences in the form: ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ Xw u Xw

⎢ Yw ⎥

⎥ ˜ ⎣ ⎦ u˜ = ⎣ v ⎦ = K r 1 r 2 r 3 T ⎢ ⎣ 0 ⎦ = K r 1r 2 T  H Yw = H X w 1 1 1 homography

(7.55)

from which it emerges that the third column of R (matrix of the extrinsic parameters) is deleted, and the homogeneous coordinates in the image plane u˜ and the corresponding 2D on the chessboard plane X˜ w = (X w , Yw , 1) are related by the homography matrix H of size 3 × 3 less than a scale factor λ:

with

λu˜ = H X˜ w

(7.56)



H = h1 h2 h3 = K r 1 r 2 T

(7.57)

Equation. (7.56) represents the homography transformation already introduced in Sect. 3.5 Vol.II, also known as projective transformation or geometric collineation between planes, while the homography matrix H can be considered the equivalent of the perspective projection matrix P, but valid for planar objects.

7.4 Camera Calibration Methods

617

7.4.3.1 Calculation of the Homography Matrix Now let’s rewrite Eq. (7.56) of homography transformation in the extended matrix form: ⎡ ⎤ ⎡ ⎤⎡ ⎤ h 11 h 12 h 13 u Xw λ ⎣ v ⎦ = ⎣h 21 h 22 h 23 ⎦ ⎣ Yw ⎦ (7.58) 1 h 31 h 32 h 33 1 From (7.58), we can get three equations, but dividing the first two by the third equation, we have two nonlinear equations in the 9 unknowns which are precisely the elements of the homography matrix H, given by3 : h 11 X w + h 12 Yw + h 13 h 31 X w + h 32 Yw + h 33 h 21 X w + h 22 Yw + h 23 v= h 31 X w + h 32 Yw + h 33

u=

(7.59)

with (u, v) expressed in nonhomogeneous coordinates. By applying these last equa(i) (i) tions to the ith correspondence (X w , Yw , 0) → (u i , vi ), related to the corners of the chessboard, we can rewrite them in the linear form, as follows: h 11 X w(i) + h 12 Yw(i) + h 13 − h 31 u i X w(i) − h 32 u i Yw(i) − h 33 u i = 0 h 21 X w(i) + h 22 Yw(i) + h 23 − h 31 vi X w(i) − h 32 vi Yw(i) − h 33 vi = 0

(7.60)

These linear constraints applied to at least N = 4 corresponding points generate a homogeneous linear system with 2N equations to determine the 9 unknown elements of the homography matrix H, defined less than a scale factor and the independent parameters are only 8. Thus we obtain the following homogeneous linear system: AH = 0

(7.61)

where the matrix A of size 2N × 9 is: ⎡

(1)

(1)

X w Yw ⎢ 0 ⎢ 0 ⎢ A = ⎢ ··· ··· ⎢ (N ) (N ) ⎣ X w Yw 0 0

and

1 0 0 0 X w(1) Yw(1) ··· ··· ··· 1 0 0 (N ) (N ) 0 X w Yw

0 1 ··· 0 1

(1)

−u 1 X w −v1 X w(1) ··· (N ) −u N X w (N ) −v N X w



T H = h1 h2 h3

(1)

−u 1 Yw −v1 Yw(1) ··· (N ) −u N Yw (N ) −v N Yw

⎤ −u 1 ⎥ −v1 ⎥ ⎥ ··· ⎥ ⎥ −u N ⎦ −v N

(7.62)

(7.63)

3 The coordinates (u, v) of the points in the image plane in Eq. (7.58) are expressed in homogeneous coordinates while in the nonlinear equations (7.59) are not homogeneous (u/λ, v/λ) but for simplicity they remain with the same notation. Once calculated H, from the third equation obtainable from (7.58), we can determine λ = h 31 u + h 32 v + 1.

618

7 Camera Calibration and 3D Reconstruction

with hi , i = 1, 3 representing the rows of the solution matrix H of the system (7.61) and 0 represents the zero vector of length 2N . The homogeneous system (7.61) can be solved with the SVD approach which decomposes the data matrix (known at least N correspondences) in the product of 3 matrices A = UV T and the solution would be the column vector of V corresponding to the singular value zero of the matrix A. In reality, a solution H is the eigenvector that corresponds to the smallest eigenvalue of AT A for less than a proportionality factor. Therefore, if we denote by h¯ the last column vector of V , it can be a solution of the homogeneous system up to a factor of proportionality. If the coordinates of the corresponding points are exact, the homography transformation is free of errors and the singular value found is zero. Normally this does not happen for the noise present in the data and especially in the case of overdetermined systems with N > 4 and in this case the singular value chosen is always the smallest seen as the optimal solution of the system (7.61) in the sense of least squares (i.e.  A · H 2 → minimum). The system (7.58) can be solved with the constraint of the scale factor h 33 = 1 and in this case the unknowns result to be 8 with the normal linear system of nonhomogeneous equations expressed in the form: AH = b where the matrix A of size 2N × 8 is ⎡ (1) (1) X w Yw 1 0 0 ⎢ (1) (1) 0 0 X w Yw ⎢ 0 ⎢ A = ⎢ ··· ··· ··· ··· ··· ⎢ (N ) (N ) ⎣ X w Yw 1 0 0 (N ) (N ) 0 0 0 X w Yw

(7.64)

(1)

0 −u 1 X w (1) 1 −v1 X w ··· ··· 0 −u N X w(N ) (N ) 1 −v N X w

⎤ (1) −u 1 Yw (1) ⎥ −v1 Yw ⎥ ⎥ ··· ⎥ (N ) ⎥ −u N Yw ⎦ (N ) −v N Yw

(7.65)

with the unknown vector:

and



T H = h 11 h 12 h 13 h 21 h 22 h 23 h 31 h 32

(7.66)



T b = u 1 v1 · · · u N v N

(7.67)

where at least 4 corresponding points are always required to determine the homography matrix H. The accuracy of the homography transformation could improve with N > 4. In the latter case, we would have an overdetermined system solvable with the least squares approach (i.e., minimizing  A · H − b 2 ) or with the method of the pseudo-inverse. The computation of H done with the preceding linear systems minimizes an algebraic error [10], and therefore not associable with a physical concept of geometric distance. In the presence of errors in the coordinate measurements of the image points of correspondences (X w , Yw , 0) → (u, v), assuming affected by Gaussian noise, the

7.4 Camera Calibration Methods

619

optimal estimate of H is the maximum likelihood estimation which minimizes the distance between the measured image points and their position predicted by the homography transformation (7.58). Therefore, the geometric error function to be minimized is given by N   u˜ i − H X˜ i 2 (7.68) min H

i=1

where N is the number of matches, u˜ i are the points in the image plane affected by noise, and X˜ i are the points on the calibration chessboard assumed accurate. This function is nonlinear and can be minimized with an iterative method like that of Levenberg–Marquart.

7.4.3.2 From Homography Matrices to Intrinsic Parameters In the previous section, we calculated the homography matrix H assuming that (i) the N checkerboard calibration points X˜ w lie in the plane X w − Yw of the world reference system and at the distance of the plane Z w = 0 with the projection model given by (7.56) and the matrix H defined by (7.57). With the assumption Z w = 0 it follows that the M chessboard observations are made by moving the camera. This implies that we will have projections in the image plane with different coordinates in the different views of the corresponding 3D points and this is not a problem for the calibration because in the homography transformation they concern the relative coordinates in the image plane, that is, the homography transform is independent of the projection reference system. At this point, we assume to have calculated the homography matrices H j , j = 1, . . . , M independently for the corresponding M observations of the calibration chessboard. In analogy to the perspective projection matrix P also, the homography matrices capture the information associated to the camera intrinsic parameters and the extrinsic ones that vary for each observation. Given the set of computed homography matrices, let us now see how to derive the intrinsic parameters from these. For each observed calibration image, we rewrite Eq. (7.57) which relates the homography matrix H to the intrinsic parameters K (see as the matrix of inner camera transformation) and extrinsic parameters R, T (the view-related external transformation), given by



(7.69) H = h1 h2 h3 = λK r 1 r 2 T where λ is an arbitrary nonzero scale factor. From (7.69), we can get the relations that link the column vectors of R as follows: h1 = λK r1 h2 = λK r2

(7.70) (7.71)

620

7 Camera Calibration and 3D Reconstruction

from which ignoring the factor λ (not useful in this context) we have r1 = K −1 h1 r2 = K

−1

(7.72)

h2

(7.73)

We know that the column vectors r1 and r2 are orthonormal by virtue of the properties of the rotation matrix R (see Note 1) which, applied to the previous equations, we get (7.74) h T K −T K −1 h2 = 0 1   r1T r2 =0

h T K −T K −1 h1 = h2T K −T K −1 h2 1  

(7.75)

r1T r1 =r2T r2 =1

which are the two relations to which the intrinsic unknown parameters associated with a homography are constrained. We now observe that the matrix of the intrinsic unknown parameters K is upper triangular and Zhang defines a new matrix B, according to the last two constraint equations found, given by ⎤ ⎡ b11 b12 b13 (7.76) B = K −T K −1 = ⎣b21 b22 b23 ⎦ b31 b32 b33 Considering the superior triangularity of K the matrix B is symmetric and with only 6 unknowns that we will group into a vector b with 6 elements:

T b = b11 b12 b22 b13 b23 b33

(7.77)

Substituting in (7.76) the elements of K defined by (6.208), we obtain the matrix B made explicit from the intrinsic parameters, as follows: ⎡

B=K

−T

K

−1

1 αu2

⎢ ⎢ ⎢ − s =⎢ ⎢ αu2 αv ⎢ ⎣

v0 s−u 0 αv αu2 αv

u v

s2 αu2 αv2

+

1 αv2

0 αv ) − s(v0αs−u − 2 α2 u v



v0 s−u 0 αv αu2 αv

− α 2sα

0 αv ) − s(v0αs−u 2 2 u αv

v0 (v0 s−u 0 αv )2 αv2 αu2 αv2

+



v0 αv2

v02 αv2

+1

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

(7.78)

Now let’s rewrite the constraint equations (7.74) and (7.75), considering the matrix B defined by (7.76), given by

h1T

Bh1 −

h1T Bh2 = 0

(7.79)

Bh2 = 0

(7.80)

h2T

From these constraint equations, we can derive a relation that links the vectors column hi = (h i1 , h i1 , h i1 )T , i = 1, 2, 3 of the homography matrix H, given by (7.69), with

7.4 Camera Calibration Methods

621

the unknown vector b, given by (7.77), obtaining ⎤T h i1 h j1 ⎢h i1 h j2 + h i2 h j1 ⎥ ⎥ ⎢ ⎥ ⎢ h i2 h j2 T ⎥ ⎢ hi Bh j = ⎢ ⎥ h h + h h i1 j3 ⎥ ⎢ i3 j1 ⎣h i3 h j2 + h i2 h j3 ⎦ h i3 h j3 ⎡



⎤ b11 ⎢b12 ⎥ ⎢ ⎥ ⎢b22 ⎥ ⎢ ⎥ = vT b ij ⎢b13 ⎥ ⎢ ⎥ ⎣b23 ⎦ b33

(7.81)

where vi j is the vector obtained from the calculated homography H and considering both the two constraint equations (7.74) and (7.75) we can rewrite them in the form of a homogeneous system in two equations, as follows:     T 0 v12 b= (7.82) 0 (v11 − v22 )T where b is the unknown vector. The information of the intrinsic parameters of the camera is captured by the M observed images of the chessboard of which we have independently estimated the relative homographies H k , k = 1, . . . , M. Therefore, every homography projection generates 2 equations (7.82) and since the unknown vector b has 6 elements at least 3 different homography projections of the chessboard are necessary (M = 3) to assemble in a homogeneous linear system of 2M equations, starting from Eq. (7.82), obtaining the following system: Vb = 0

(7.83)

where V is the matrix of the known homographies of size 2M × 6 and b is the unknown vector. Also in this case we have an overdetermined system with M ≥ 3 which can be solved with the SVD method obtaining a solution b less than a scale factor. Recall that the estimated solution of the system (7.83) corresponds to the rightmost column vector of V (obtained from the decomposition of V = U U ) associated with the smallest singular value which is the minimum value of the diagonal elements of . Calculated the vector b (and therefore the matrix B), it is possible to derive then in closed form [6] the intrinsic parameters that is the matrix K . In fact, from (7.76) we have the relation that binds the matrices B and K less than the unknown scale factor λ (B = λK −T K −1 ). The intrinsic parameters obtainable in closed form, proposed by Zhang, are 2 v0 = (b12 b13 − b11 b23 )/(b11 b22 − b12 )  αu = λ/b11

s = −b12 αu2 αv /λ

2 λ = b33 − [b13 + v0 (b12 b13 − b11 b23 )]/b11  2 ) αv = λb11 /(b11 b22 − b12

u 0 = sv0 /αu − b13 αu2 /λ

(7.84) The calculation of K from B could also be formulated with the Cholesky factorization [8].

622

7 Camera Calibration and 3D Reconstruction

7.4.3.3 Estimation of Extrinsic Parameters Calculated the intrinsic parameters matrix K we can estimate the extrinsic parameters R and T for each observation k of the chessboard using the relative homography H k . In fact, according to (7.57) we get r1 = λK −1 h1

r 2 = λK−1 h2

T = λK−1 h3

(7.85)

where remembering that the vectors of the rotation matrix have unitary norm, the scale factor is 1 1 = (7.86) λ= −1 −1  K h1   K h2  Finally, for the orthonormality of R we have r3 = r1 × r2

(7.87)

The extrinsic parameters are different for each homography because the points of view of the calibration chessboard are different. The rotation matrix R may not satisfy numerically the orthogonality properties of a rotation matrix due to the noise of the correspondences. In [8,9], there are techniques to approximate the calculated R for a tr ue rotation matrix. A technique is based on the SVD decomposition, imposing the orthogonality R R T = I by forcing the matrix  to the identity matrix: ⎡ ⎤ 100 ¯ = U ⎣0 1 0 ⎦ V T (7.88) R 001

7.4.3.4 Estimation of Radial Distortions The homography projections considered so far are assumed according to the pinhole projection model. We know instead that the optical system introduces radial and tangential distortions by altering the position of the 3D points in the image plane (see Sect. 7.2). Now let’s see how to consider the effects only of radial distortions (which are the most relevant) on the M observed homography projections. Having in the preceding paragraphs estimated for each projection the intrinsic parameters matrix K , with the homography transformation (7.56) we obtained the ideal projection of the points in the image plane u = (u, v), while of the same we have the observations in the image plane indicated with u¯ = (u, ¯ v¯ ) which we assume influenced by radial distortions thus obtaining a distortion on the real coordinates of each point in the image plane given by (u¯ − u). This radial distortion we know to be modeled by Eqs. (7.6) and (7.7) that, adapted to this context and rewritten in vector form for each point of the observed images, we obtain u − u0 · D(r, k) = u¯ − u

(7.89)

7.4 Camera Calibration Methods

623

where we recall that k is the vector of the coefficients of the nonlinear radial distortion function D(r, k) and r = x − x 0 = x  is the distance of the point x, associated with the projected point u, is not calculated in the image plane, but is calculated as the distance of x from the principal point x 0 = (0, 0), expressed not in pixels. In this equation, knowing the ideal coordinates of the projected points u and the distorted ¯ the unknown to be determined is the vector k = (k1 , k2 ) (approxones observed u, imating the nonlinear distortion function to only 2 coefficients) which rewritten in matrix form, for each point of each observed image, two equations are obtained:      u¯ − u (u − u 0 ) · r 2 (u − u 0 ) · r 4 k1 = (7.90) v¯ − v (v − v0 ) · r 2 (v − v0 ) · r 4 k2 With these equations, it is possible to set up a system of linear equations by assembling together all the equations associated with the N points of the M images observed obtaining a system of 2M N equations, which in compact matrix form results in the following: Dk = d (7.91) The estimate of the vector k = (k1 , k2 ) solution of such overdetermined system can be determined with the least squares approach with the method of the pseudo-inverse of Moore–Penrose for which k = ( D T D)−1 D T d, or with SVD or QR factorization methods.

7.4.3.5 Nonlinear Optimization of Calibration Parameters and Radial Distortion The approaches used to estimate the calibration parameters (intrinsic and extrinsic) and radial distortions are based on the minimization of an algebraic distance, already highlighted previously to be of no physical significance. The noise present in the homography projection of the N chessboard 3D points in the M images are defined by the homography transformation (7.56) and by radial distortion (7.89). If we indicate with u¯ i j the observed image points and with ui j (K , k, Ri , T i , X j ) the projection of a point X j in the ith homography image we can refine the results of all previously obtained parameters through the maximum likelihood estimation (MLE) minimizing the geometric error function, given by N  M 

 u¯ i j − ui j (K , k, Ri , T i , X j ) 2

(7.92)

j=1 i=1

This nonlinear least squares minimization problem can be solved by iterative methods such as the Levenberg–Marquardt algorithm. The iterative process is useful to start with the estimates of the intrinsic parameters obtained in Sect. 7.4.3.2 and with the extrinsic parameters obtained in Sect. 7.4.3.3. The initial parameters of the radial distortion coefficients can be with an initial zero value or with those estimated in the previous paragraph. It should be noted that the rotation matrix R has 9 elements

624

7 Camera Calibration and 3D Reconstruction

despite having 3 degrees of freedom (that is, the three angles of rotation around the 3D axes). The Euler–Rodrigues method [11,12] is used in [6] to express a 3D rotation with only 3 parameters.

7.4.3.6 Summary of Zhang’s Autocalibration Method This camera calibration method uses as a calibration platform a plane object, for example, a chessboard, whose geometry and the positions of the identified patterns (at least four) are observed from at least two different points of view. The essential steps of the calibration procedure are 1. Acquisition of M images of the calibration platform observed from different points of view: moving the camera with respect to the platform or vice versa or moving both. 2. For each of the M images, N patterns are detected whose correct correspondence is known with the associated points on the chessboard (3D points). For each 3D point on the chessboard the coordinates are known X = (X, Y, 0) (points lying in the same plane Z = 0). Of these points, are known the 2D coordinates u¯ = (u, ¯ v¯ ) (expressed in pixels) corresponding to each homography image. 3. Knowns the correspondences 3D ↔ 2D, the M relative homographies are calculated for each of the M images. The homographs {H 1 , H 2 , . . . , H M } are independent of the coordinate reference system of the 2D and 3D points. 4. Knowing the homographies {H 1 , H 2 , . . . , H M }, the intrinsic parameters of the camera are estimated, that is, the 5 elements of the intrinsic parameters matrix K , using the linear closed-form solution. In this context, the radial distortion introduced by the optical system in the homography projection of the calibration points is ignored. If the homography images are at least 3, K is determined as the only solution to less than an indeterminate scale factor. If you have multiple homography images the estimated elements of K are more accurate. The camera calibration procedure can take on zero the sensor skew parameter (s = 0 considering the good level of accuracy of the current sensors available) and in this case, the number of homography images can be only two. Once the homographies are known, for each 3D point on the chessboard we can calculate its homography projection (according to the pinhole model) and obtain the ideal coordinates u = (u, v) generally different from those observed u¯ = (u, ¯ v¯ ). In fact, the latter are affected by noise, partly due to the uncertainty of the algorithms that detect the points in the image plane (inaccurate measurements of the coordinates of the 2D points detected in each homography image) and also caused by the optical system. 5. Once the intrinsic parameters are known, it is possible to derive the extrinsic parameters, that is the rotation matrix R and the vector translation T related to the M views thus obtaining the corresponding attitude of the camera. 6. Estimate of the vector k relative to the coefficients of the nonlinear function that models the radial distortion introduced by the optical system.

7.4 Camera Calibration Methods

625

7. Refining the accuracy of intrinsic and extrinsic parameters and radial distortion coefficients initially estimated with least squares methods. Basically, starting from these initial parameters and coefficients, a nonlinear optimization procedure based on the maximum likelihood estimation (MLE) is applied globally to all the parameters related to the M homography images and the N points observed. The camera calibration results, for the different methodologies used, are mainly influenced by the level of accuracy of the 3D calibration patterns (referenced with respect to the world reference system) and the corresponding 2D (referenced in the reference system of the image plane). The latter are dependent on the automatic pattern detection algorithms in the acquired images of the calibration platform. Another important aspect concerns how the pattern localization error is propagated to determine the camera calibration parameters. In general, the various calibration methods, at least theoretically, should produce the same results, but in reality then differ in the solutions adopted to minimize pattern localization and optical system errors. The propagation of errors is highlighted in particular when the configuration of the calibration images is modified (for example, the focal length varies), while the experimental configuration remains intact. In this situation, the extrinsic parameters do not remain stable. Similarly, the instability of the intrinsic ones occurs when the experimental configuration remains the same while only the translation of the calibration patterns varies.

7.4.4 Stereo Camera Calibration In the previous paragraphs, we have described the methods for calibrating a single camera, that is, we have defined what the characteristic parameters are, how to determine them with respect to the known 3D points of the scene assuming the pinhole projection model. In particular, the following parameters have been described: the intrinsic parameters that characterize the optical-sensor components defining the camera intrinsic matrix K , and the extrinsic parameters, defining the rotation matrix R and the translation vector T with respect to an arbitrary reference system of the world that characterize the attitude of the camera with respect to the 3D scene. In the stereo system (with at least two cameras), always considering the pinhole projection model, a 3D light spot of the scene is seen (projected) simultaneously in the image plane of the two cameras. While with the monocular vision the 2D projection of a 3D point defines only the ray that passes through the optical center and the 2D intersection point with the image plane, in the stereo vision the 3D point is uniquely determined by the intersection of the homologous rays that generate their 2D projections on the corresponding image planes of the left and right camera (see Fig. 7.4). Therefore, once the calibration parameters of the individual cameras are known, it is possible to characterize and determine the calibration parameters of a stereo system and establish a unique relationship between a 3D point and its 2D projections on the stereo images.

626

7 Camera Calibration and 3D Reconstruction

P

Fig. 7.4 Pinhole projection model in stereo vision

λPL ZL

uL vL

pR=(uR,vR) vR pL=(uL,vL)

cL f

CL

PL=(XL,YL,f) XL

YL

uR

ZR

PR=(XR,YR,f) T R

cR f CR

XR

YR

According to Fig. 7.4, we can model the projections of the stereo system from the mathematical point of view as an extension of the monocular model seen as rigid transformations (see Sect. 6.7) between the reference systems of the cameras and the world. The figure shows the same nomenclature of monocular vision with the addition of the subscript L and R to indicate the parameters (optical centers, 2D and 3D reference systems, focal length, 2D projections, …), respectively, of the left and right camera. If T is the column vector representing the translation between the two optical centers C L and C R (the origins of the reference systems of each camera) and R is the rotation matrix that orients the left camera axes to those of the right camera (or vice versa), then the coordinates of a world point P w = (X, Y, Z ) of 3D space, denoted by P L = (X L p , Y L p , Z L p ) and P R = (X R p , Y R p , Z R p ) in the reference system of the two cameras, they are related to each other with the following equations: P R = R( P L − T )

(7.93)

P L = RT P R + T

(7.94)

where R and T characterize the relationship between the left and right camera coordinate systems, which is independent of the projection model of each camera. R and T are essentially the extrinsic parameters that characterize the stereo system in the pinhole projection model. Now let’s see how to derive the extrinsic parameters of the stereo system R, T knowing the extrinsic parameters of the individual cameras. Normally the cameras are individually calibrated considering known 3D points, defined with respect to a world reference system. We indicate with P w = (X w , Yw , Z w ) the coordinates in the world reference system, and with R L , T L and R R , T R the extrinsic parameters of the two cameras, respectively, the rotation matrices and the translation column vectors. The relationships that project the

7.4 Camera Calibration Methods

627

point P w in the image plane of the two cameras (according to the pinhole model), in the respective reference systems, are the following: P L = RL Pw + T L

(7.95)

P R = RR Pw + T R

(7.96)

We assume that the two cameras have been independently calibrated with one of the methods described in Sect. 7.4, and therefore their intrinsic and extrinsic parameters are known. The extrinsic parameters of the stereo system are obtained from Eqs. (7.95) and (7.96) as follows: 

for the (7.96)

 

( P − T ) +T L P L = R L P w + T L = R L R−1 R R R comparison with the (7.94)

   −1 = (R L R−1 ) P −(R R )T + T R L R L R   R    

(7.97)

T

RT

from which we have R T = (R L R−1 R )

⇐⇒

R = R TL R R

T = T L − R L R−1 T R = T L − RT T R   R 

(7.98) (7.99)

RT

where (7.98) and (7.99) define the extrinsic parameters (the rotation matrix R T and the translation vector T ) of the stereo system. At this point, the stereo system is completely calibrated and can be used for 3D reconstruction of the scene starting from 2D stereo projections. This can be done, for example, by triangulation as described in Sect. 7.5.7.

7.5 Stereo Vision and Epipolar Geometry In Sect. 4.6.8, we have already introduced the epipolar geometry. This section describes how to use epipolar geometry to solve the problem of matching homologous points in a stereo vision system with the two calibrated and non-calibrated cameras. In other words, with the epipolar geometry we want to simplify the search for homologous points between the two stereo images.

628

7 Camera Calibration and 3D Reconstruction

Let us remember with the help of Fig. 7.5a as a point P in 3D space is acquired by a stereo system and projected (according to the pinhole model) in PL in the left image plane and in PR in the right image plane. Epipolar geometry establishes a relationship between the two corresponding projections PL and PR in the two stereo images acquired by cameras (having optical centers C L and C R ), which can have different intrinsic and extrinsic parameters. Let us briefly summarize notations and properties of epipolar geometry: Baseline is the line that joins the two optical centers and defines the inter-optical distance. Epipolar Plane is the plane (which we will indicate from now on with π ) which contains the baseline line. A family of epipolar planes are generated that rotate around the baseline and passing through the 3D points of the considered scene (see Fig. 7.5b). For each point P, an epipolar plane is generated containing the three points {C L , P, C R }. An alternative geometric definition is to consider the 3D epipolar plane containing the projection PL (or the projection PR ) together with the left and right optical centers C L and C R . Epipole is the intersection point of the baseline line with the image plane. The epi pole can also be seen as the projection of the optical center of a camera on the image plane of the other camera. Therefore, we have two epi poles indicated with e L and e R , respectively, for the left and right image. If the image planes are coplanar (with parallel optical axes) the epi poles are located at the opposite infinites (intersection at infinity between baseline and image planes since they are parallel to each other). Furthermore, the epipolar lines are parallel to an axis of each image plane (see Fig. 7.6). Epipolar Lines indicated with lL and lR are the intersections between an epipolar plane and the image planes. All the epipolar lines intersect in the relative epipoles (see Fig. 7.5b). From Fig. 7.5 and from the properties described above, it can be seen that given a point P of the 3D space, its projections PL and PR in the image planes, and the optical centers C L and C R are in the epipolar plane π generated by the triad {C L , P, C R }. It also follows that the rays drawn backwards from the PL and PR points intersecting in P are coplanar to each other and lie in the same epipolar plane identified. This last property is of fundamental importance in finding the correspondence of the projected points. In fact, if we know PL , to search for the homologous PR in the other image we have the constraint that the plane π is identified by the triad {C L , P, C R } (i.e., from the baseline and from the ray defined by PL ) and consequently also the ray corresponding to the point PR must lie in the plane π , and therefore PR itself (unknown) must be on the line of intersection between the plane π and the plane of the second image. This intersection line is just the right epipolar line lR that can be thought of as the projection in the second image of the backprojected ray from PL . Essentially, lR is the searched epipolar line corresponding to PL and we can indicate this correspondence as follows:

7.5 Stereo Vision and Epipolar Geometry

629

(b)

(a)

Epipolar planes

P π

Epipolar plane

PL cL

PR

Epipolar lines

lL

lR

Epipolar lines

cL

cR

eL

cR

eR

Epipoles

Baseline

Fig. 7.5 Epipolar geometry. a The baseline is the line joining the optical centers C L and C R and intersects the image planes in the respective epipoles and L and e R . Each plane passing through the baseline is an epipolar plane. The epipolar lines l L and l R are obtained from the intersection of an epipolar plane and the stereo image planes. b At each point P of the 3D space corresponds an epipolar plane that rotates around the baseline and intersects the relative pair of epipolar lines in the image planes. In each image plane, the epipolar lines intersect in the relative epi pole

P uL

uR

eL vL

pL

vR

cL

eR pR cR

Fig. 7.6 Epipolar geometry for a stereo system with parallel optical axes. In this case, the image planes are coplanar and the epi poles are in the opposite infinite. The epipolar lines are parallel to the horizontal axis of each image plane

PL → lR which establishes a dual relationship between a point in an image and the associated line in the other stereo image. For a stereo binocular system, with the epipolar geometry, that is, known the epi poles and the epipolar lines, it is possible to restrict the possible correspondences between the points of the two images by searching the homologue of PL only on the corresponding epipolar line lR in the other image and not over the entire image (see Fig. 7.7). This process must be repeated for each 3D point of the scene.

7.5.1 The Essential Matrix Let us now look at how to formalize epipolar geometry in algebraic terms, using the Essential matrix [13], to find correspondences P L → lR [11]. We denote by

630

7 Camera Calibration and 3D Reconstruction

Fig. 7.7 Epipolar geometry for a converging stereo system. For the pair of stereo images, the epipolar lines are superimposed for the simplified search for homologous points according to the dual relationship point/epipolar line PL → lR and PR → lL

P L = (X L , Y L , Z L )T and P R = (X R , Y R , Z R )T , respectively, the coordinates of the point P with respect to the systems of reference (with origin in C L and C R respectively) of the two left and right cameras4 (see Fig. 7.4). Knowing a projection of P in the image planes and the optical centers of the calibrated cameras (known orientations and calibration matrices), we can calculate the epipolar plane and consequently calculate the relative epipolar lines, given by intersection between epipolar plane and image plane. For the properties of epipolar geometry, the projection of P, in the right image plane in P R lies on the right epipolar line. Therefore, the foundation of epipolar geometry is that it allows us to create a strong link between pairs of stereo images without knowing the 3D structure of the scene. Now let’s see how to find an algebraic solution to find these correspondences, point/epipolar line P L → lR , through stereo pairs of images. According to the pinhole model, we consider the left camera with a perspective projection matrix5 P L = [I| 0] and both the origin of the reference stereo system, while the right camera is positioned with respect to the left one according to the perspective projection matrix P R = [R| T ] characterized by extrinsic parameters R and T , respectively, the rotation matrix and the translation vector. Without losing generality, we can assume the calibration matrix K of each camera equal to the identity matrix I. To solve the correspondence problem P L → lR needs to be mapped in the reference system of C R (the right camera), the homologous candidate points of P L , which are on the epipolar line lR in the right image plane.

4 We

know that a 3D point, according to the pinhole model, projected in the image plane in P L defines a ray passing through the optical center C L (in this case the origin of the stereo system) locus of 3D points aligned represented by λ P L . These points can be observed by the right camera and referenced in its reference system to determine the homologous points using the epipolar geometry approach. We will see that this is possible so that we will neglect the parameter λ in the following. 5 From now on, the perspective projection matrices will be indicated with P to avoid confusion with the points of the scene indicated with P.

7.5 Stereo Vision and Epipolar Geometry

631 P

Epipolar line PL=(XL,YL,f)

lL

CL YL

eL

XL

PR=(XR,YR,f)

ZR lR eR

ZL T

XR

CR YR

R

Fig. 7.8 Epipolar geometry. Derivation of the essential matrix from the coplanarity constraint between the vectors CL PL , CR PR and CR CL

In other words, to find P R it can be mapped P L in the reference system C R through the roto-translation parameters [R| T ].6 This is possible by applying Eq. 7.96 as follows: (7.100) PR = RPL + T Pre-multiplying both members of (7.100) first vectorially for T and then scaling for P TR , we get P T (T × P R ) = P TR (T × R P L ) + P TR (T × T ) (7.101)      R  =0

=0

For the property of the vector product T × T = 0 and the scalar one considered the coplanarity of the vectors P TR (T × P R ) = 0, the previous relationship becomes P TR (T × R P L ) = 0

(7.102)

From the geometric point of view (see Fig. 7.8), Eq. (7.102) expresses the coplanarity of the vectors CL PL , CR PR and CR CL representing, respectively, the projection rays of a point P in the respective image planes and the direction of the vector translation T . At this point, from the algebra we use the property of the antisymmetric matrix consisting of only three independent elements that can be considered as the elements of a vector with three components.7 For the translation vector T = (Tx , Y y , Z z )T ,

6 With

reference to Note 1, we recall that the matrix R provides the orientation of the camera C R with respect to the C L one. The column vectors are the direction cosines of C L axes rotated with respect to the C R . 7 A matrix A is said to be antisymmetric when it satisfies the following properties: A + AT = 0

AT = − A

It follows that the elements on the main diagonal are all zeroes while those outside the diagonal satisfy the relation ai j = −a ji . This means that the number of elements is only n(n − 1)/2 and for n = 3 we have a matrix of the size 3 × 3 of only 3 elements that can be considered as the components

632

7 Camera Calibration and 3D Reconstruction

we would have its representation in terms of antisymmetric matrix, indicated with [T ]× , as follows: ⎡ ⎤ 0 Tz −Ty [T ]× = ⎣−Tz 0 Tx ⎦ (7.103) Ty −Tx 0 where conventionally [•]× indicates the operator that transforms a 3D vector into an antisymmetric matrix 3 × 3. It follows that we can express the vector T in terms of antisymmetric matrix [T ]× and define the matrix expression: E = [T ]× R

(7.104)

P TR E P L = 0

(7.105)

which replaced in (7.102) we get

where E, defined by (7.104), is known as the essential matrix which depends only on the rotation matrix R and the translation vector T , and is defined less than a scale factor. Equation (7.105) is still valid by scaling the coordinates from the reference system of the cameras to those of the image planes, as follows:  p L = (x L , y L ) = T

X L YL , ZL ZL

T

 p R = (x R , y R ) = T

X R YR , ZR ZR

T

which are normalized coordinates, so we have pTR E p L = 0

(7.106)

This equation realizes the epipolar constraints, i.e., for a 3D point projected in the stereo image planes it relates the homologous vectors p L and p R . It also expresses the coplanarity between any two corresponding points p L and p R included in the same epipolar plane for the two cameras. In essence, (7.105) expresses the epipolar constraint between the rays that intersect the point P of the space coming from the two optical centers, while Eq. (7.106) relates points homologues between the image planes. Moreover, for any projection in the left image plane p L , through the essential matrix E, the epipolar line in the right

of a generic three-dimensional vector v. In this case, we use the symbolism [v]× or S(v) to indicate the operator that transforms the vector v in an antisymmetric matrix as reported in (7.103). Often this dual form of representation between vector and antisymmetric matrix is used to write the vector product or outer product between two three-dimensional vectors with the traditional form x × y in the form of simple product [x]× y or S(x) y.

7.5 Stereo Vision and Epipolar Geometry

633

image is given by the three-dimensional vector (see Fig. 7.8)8 : l R = E pL

(7.107)

pTR E p L = pTR l R = 0   

(7.108)

and Eq. (7.106) becomes lR

which shows that p R , the homologue of p L , is on the epipolar line l R (defined by 7.107) according to the Note 8. Similarly, for any projection in the right image plane p R , through the essential matrix E, the epipolar line in the left image is given by the three-dimensional vector: l TL = pTR E

=⇒

lL = pR ET

(7.109)

and Eq. (7.106) becomes pTR E p L = l TL p L = 0   

(7.110)

lL T

which verifies that p L is on the epipolar line l L (defined by the 7.109) according to the Note 8. Epipolar geometry requires that the epi poles are in the intersection between the epipolar lines and the translation vector T . Another property of the essential matrix is that its product with the epipoles e L and e R is equal to zero: eTR E = 0

Ee L = 0

(7.111)

In fact, for each point p L , except e L , in the left image plane must hold Eq. (7.107) of the right epipolar line l R = E p L , where the epipole e R also lies. Therefore, also the epipole e R must satisfy (7.108) thus obtaining eTR E p L = (eTR E) p L = 0

∀ pL



eTR E = 0

or

ET eR = 0

The epipole e R is thus in the null space on the left of E. Similarly, it is shown for the left epipole e L that Ee L = 0, i.e, which is in the null space on the right of E. The equations (7.111) of the epipoles can be used to calculate their position by knowing E. reference to the figure, if we denote by p˜ L = (x L , y L , 1) the projection vector of P in the left image plane, expressed in homogeneous coordinates, and with l L = (a, b, c) the epipolar line expressed as 3D vector whose equation in the image plane is ax L + by L + c = 0, then the constraint that the point p˜ L is on the epipolar line l L induces l L p˜ L = 0 or l TL p˜ L = 0 or p˜ TL l L = 0. Furthermore, we recall that a line l passing through two points p1 and p2 is given by the following vector product l = p1 × p2 . Finally, a point p as the intersection of two lines l 1 and l 2 is given by p = l 1 × l 2.

8 With

634

7 Camera Calibration and 3D Reconstruction

The essential matrix has rank 2 (so it is also singular) since the antisymmetric matrix [T ]× is of rank 2. It also has 5 degrees of freedom, 3 associated with rotation angles and 2 for the vector T defined less than a scale factor. We point out that, while the essential matrix E associates a point with a line, the homography matrix H associates a point with another point ( p L = H p R ). An essential matrix has two equal singular values and the third equals zero. This property can be demonstrated by decomposing it with the SVD method and verifying with E = UV that the first two elements of the main diagonal of  are σ1 = σ2 = 0 and σ3 = 0. Equation (7.106) in addition to solving the problem of correspondence in the context of epipolar geometry is also used for 3D reconstruction. In this case at least 5 corresponding points are chosen, in the stereo images, generating a linear system of equations based on (7.106) to determine E and then R and T are calculated. We will see in detail in the next paragraphs the 3D reconstruction of the scene with the triangulation procedure.

7.5.2 The Fundamental Matrix In the previous paragraph, the coordinates of the points in relation to the epipolar lines were expressed in the reference system of the calibrated cameras and in accordance with the pinhole projection model. Let us now propose to obtain a relationship analogous to (7.106) but with the points in the image plane expressed directly in pixels. Suppose that for the same stereo system considered above the cameras calibration matrices K L and K R are known with the projection matrices P L = K L [I | 0] and P R = K R [R | T ] for the left and right camera, respectively. We know from Eq. (6.208) that we can get, for a 3D point with (X, Y, Z ) coordinates, the homogeneous coordinates in pixels u˜ = (u, v, 1) in stereo image planes, which for the two left and right images are given by u˜ L = K L p˜ L

u˜ R = K R p˜ R

(7.112)

where p˜ L and p˜ R are the homogeneous coordinates, of the 3D point projected in the stereo image planes, expressed in the reference system of the cameras. These last coordinates can be derived from (7.112) obtaining ˜L p˜ L = K −1 L u

˜R p˜ R = K −1 R u

(7.113)

that putting them in Eq. (7.106) of the essential matrix is obtained: ˜ R )T E(K −1 ˜ L) = 0 (K −1 R u L u u˜ TR K −T E K −1 u˜ L = 0  R  L  F

(7.114)

7.5 Stereo Vision and Epipolar Geometry

635

from which we can derive the following matrix: −1 F = K −T R EKL

(7.115)

where F is known as the Fundamental Matrix (proposed in [14,15]). Finally we get the equation of the epipolar constraint based on F, given by u˜ TR F u˜ L = 0

(7.116)

where the fundamental matrix F has size 3 × 3 and rank 2. As for the essential matrix E, Eq. (7.116) is the fundamental algebraic tool based on the fundamental matrix F for the 3D reconstruction of a point P of the scene observed from two views. The fundamental matrix represents the constraint of the correspondence of the homologous image points u˜ L ↔ u˜ R being 2D projections of the same point P of 3D space. As done for the essential matrix we can derive the epipolar lines and the epipoles from (7.116). For homologous points u˜ L ↔ u˜ R , we have for u˜ R the constraint to lie on the epipolar line l R associated with the point u˜ L , which is given by (7.117) l R = F u˜ L such that u˜ TR F u˜ L = u˜ TR l R = 0. This dualism, of associating a point of an image plane to the epipolar line of the other image plane, is also valid in the opposite sense, so a point u˜ L on the left image, homologue of u˜ R , must lie on the epipolar line l L associated with the point u˜ R , which is given by l L = F T u˜ R

(7.118)

such that u˜ TR F u˜ L = (u˜ TR F)u˜ L = l TL u˜ L = 0. From the equations of the epipolar lines (7.117) and (7.118), subject to the constraint of the fundamental matrix equation (7.115), we can associate epipolar lines with the respective epipoles e˜ L and e˜ R , since the latter must satisfy the following relations: (7.119) F e˜ L = F T e˜ R = F e˜ TR = 0 from which emerges the direct association of the epipole e L with the epipolar line F e˜ L and which is in the null space of F, while the epipole e R is associated with the epipolar line F T e˜ R and is in the null space of F. It should be noted that the position of the epipoles does not necessarily fall within the domain of the image planes (see Fig. 7.9a). A further property of the fundamental matrix concerns the transposed property: if − → F is the fundamental matrix relative to the stereo pair of cameras C L → C R , then ← − the fundamental matrix F of the stereo pair of cameras ordered in reverse C L ← C R − → is equal to F T . In fact, applying (7.116) to the ordered pair C L ← C R we have ← − − → ← − − → u˜ TL F u˜ R = 0 =⇒ u˜ TR F T u˜ L = 0 f or which F = F T

636

7 Camera Calibration and 3D Reconstruction

(a)

(b) P pL

cL

eL

eR

cR

cL

Hπ eL

π pR lR eR

cR

Fig. 7.9 Epipolar geometry and projection of homologous points through the homography plane. a Epipoles on the baseline but outside the image planes; b Projection of homologous points by homography plane not passing through optical centers

Finally, we analyze a further feature of the fundamental matrix F also rewriting the equation of the fundamental matrix (7.115) with the essential matrix E expressed by (7.104), thus obtaining −1 −T −1 F = K −T R E K L = K R [T ]× R K L

(7.120)

We know that the determinant of the antisymmetric matrix [T ]× is zero, it follows that the det (F) = 0 and the rank of F is 2. Although both matrices include the constraints of the epipolar geometry of two cameras and simplify the correspondence problem by mapping points of an image only on the epipolar line of the other image, from Eqs. (7.114) and (7.120), which relate the two matrices F and E, it emerges that the essential matrix uses the coordinates of the camera and depends on the relative extrinsic parameters (R and T ), while the fundamental matrix operates directly with the coordinates in pixels and can be abstracted from the knowledge of the intrinsic and extrinsic parameters of the cameras. Knowing the intrinsic parameters (the matrices K ) from (7.120), it is observed that the fundamental matrix is reduced to the essential matrix, and therefore it operates directly in coordinates of the cameras. An important difference between the E and F matrices is the number of degrees of freedom, the Essential matrix has 5 while the Fundamental matrix has 7.

7.5.2.1 Relationship Between Fundamental and Homography Matrix The fundamental matrix is an algebraic representation of epipolar geometry. Let us now see a geometric interpretation of the fundamental matrix that maps homologous points in two phases [10]. In the first phase, the point p L is mapped to a point p R in the right image that is potentially the homologue we know lies on the right epipolar line l R . In the second phase the epipolar line l R is calculated as the line passing through p R and the epipole e R according to the epipolar geometry. With reference to Fig. 7.9b, we consider a point P of the 3D space lying in a plane (not passing through the optical centers of the cameras) and projected in the left image plane in the point p L with coordinates u L . Then P is projected in the right image plane in

7.5 Stereo Vision and Epipolar Geometry

637

the point p R with coordinates u R . Basically, by the projection of P in the left and right image planes, we can consider it occurred through the plane . From epipolar geometry we know that p R lies on the epipolar line l R (projection of the ray P − p L ) and also passing through the right epipole e R . Any other point in the plane is projected in the same way in the stereo image planes thus realizing an omographic projection H to map each point p L i of an image plane in the corresponding points p Ri in the other image plane. Therefore, the homologous points between the stereo image planes we can consider them mapped by the 2D homography transformation: u R = H uL Then, imposing the constraint that the epipolar line l R is the straight line passing through p R and the epiple e R , with reference to the Note 8, we have l R = e R × u R = [e R ]× u R = [e R ]× H u L = Fu L   

(7.121)

F

from which, considering also (7.117), we obtain the searched relationship between homography matrix and fundamental matrix, given by F = [e R ]× H

(7.122)

where H is the homography matrix with rank 3, F is the fundamental matrix of rank 2, and [e R ]× is the epipole vector expressed as an antisymmetric matrix with rank 2. Equation (7.121) is valid for any ith point p L i projected from the plane and must satisfy the equation of epipolar geometry (7.116). In fact, replacing in (7.116) u R given by the homography transformation and considering the constraint that the homologous of each p L i must be on the epipolar line l R , given by (7.121), we can verify that the constraint of the epipolar geometry remains valid, as follows: u TR Fu L = (H u L )T [e R ]× u R = u TR [e R ]× H u L = 0            H uL

lR

u TR

lR

(7.123)

F

thus confirming the relationship between fundamental and homography matrix expressed by (7.122). From the geometric point of view, it has been shown that the fundamental matrix projects 2D points from an image plane to a 1D point lying on the epipolar line l R passing through the homologous point and the epipole e R of the other image plane (see Fig. 7.8), abstracting from the scene structure. The planar homography, on the other hand, is a projective transformation one to one (H 3×3 of rank 3), between 2D points, applied directly to the points of the scene (involved by the homography plane

) with the transformation u R = H u L and can be considered as a special case of the fundamental matrix.

638

7 Camera Calibration and 3D Reconstruction

7.5.3 Estimation of the Essential and Fundamental Matrix In theory, both the E and F matrices can theoretically be estimated experimentally using a numerical method knowing a set of corresponding points between the stereo images. In particular, it is possible to estimate the fundamental matrix without knowing the intrinsic and extrinsic parameters of the cameras, while for the essential matrix the attitudes of the cameras must be known.

7.5.3.1 8-Point Algorithm A method for calculating the fundamental matrix is the one proposed in [10,13] known as 8-point algorithm. In this approach, at least 8 correspondences are used ul = (u l , vl , 1) ↔ ur = (u r , vr , 1) between stereo images. The F matrix is estimated by setting the problem with a homogeneous system of linear equations applying the equation of the epipolar constraint (7.116) for n ≥ 8 as follows: urTi Fuli = 0 which in matrix form is

i = 1, . . . , n



⎤⎡ ⎤ f 11 f 12 f 13 u li u ri vri 1 ⎣ f 21 f 22 f 23 ⎦ ⎣ vli ⎦ = 0 f 31 f 32 f 33 1

(7.124)

(7.125)

from which by making explicit we get u li u ri f 11 + u li vri f 21 + u li f 31 + vli u ri f 12 + vli vri f 22 + vli f 32 + u ri f 13 + vri f 23 + f 33 = 0

(7.126)

If we group in the 9 × 1 vector f = ( f 11 , . . . , f 33 ) the unknown elements of F, (7.126) can be reformulated as an inner product between vectors in the form: (u li u ri , u li vri , u li , vli u ri , vli vri , vli , u ri , vri , 1) · f = 0

(7.127)

Therefore, we have an equation for every correspondence uli ↔ uri , and with n correspondences we can assemble a homogeneous system of n linear equations as follows: ⎡

⎡ ⎢ ⎣

u l1 u r1 u l1 vr1 .. .. . . u ln u rn u ln vrn

u l1 vl1 u r1 vl1 vr1 .. .. .. . . . u ln vln u rn vln vrn

vl1 .. . vln

⎤ f 11 ⎢ f 12 ⎥ ⎢ ⎥ ⎤ ⎢ f 13 ⎥ ⎥ u r1 vr1 1 ⎢ ⎢ f 21 ⎥ ⎢ ⎥ .. .. .. ⎢ ⎥ ⎥ . . . ⎦ ⎢ f 22 ⎥ = A f = 0 ⎢ f 23 ⎥ ⎥ u rn vrn 1 ⎢ ⎢ f 31 ⎥ ⎢ ⎥ ⎣ f 32 ⎦

(7.128)

f 33

where A is a matrix of size n × 9 derived from the n correspondences. To admit a solution, the correspondence matrix A of the homogeneous system must have at least rank 8. If the rank is exactly 8 we have a unique solution of f less than a

7.5 Stereo Vision and Epipolar Geometry

639

factor of scale and can be determined with linear methods the null space solution of the system. Therefore, 8 correspondences are sufficient from which the name of the algorithm follows. In reality, the coordinates of the homologous points in stereo images are affected by noise and to have a more accurate estimate of f it is useful to use a number of correspondences n  8. In this case, the system is solved with the least squares method finding a solution f that minimizes the following summation: n 

(urTi f uli )2

(7.129)

i=1

subject to the additional constraint such that  f = 1 since the norm of u is arbitrary. The least squares solution of f corresponds to the smallest singular value of the SVD decomposition of A = UV T , taking the components of the last column vector of V (which corresponds to the smallest eigenvalue). Remembering some properties of the fundamental matrix it is necessary to make some considerations. We know that F is a singular square matrix (det (F) = 0) of size 3 × 3 = 9 with rank 2. Also F has 7 degrees of freedom motivated as follows. The constraint of rank 2 implies that any column is a linear combination of the other two. For example, the third is the linear combination of the first two. Therefore, the first two elements of the third column specify the linear combination and then give the third element of the third column. This suggests that F has eight degrees of freedom. Furthermore, by operating in homogeneous coordinates, the elements of F can be scaled to less than a scale factor without violating the epipolar constraint of (7.116). It follows that the degrees of freedom are reduced to 7. Another aspect to consider is the effect of the noise present in the correspondence data on the SVD decomposition of the matrix A. In fact, this causes the ninth singular value obtained to be different from zero, and therefore the estimate of F is not really with rank equal to 2. This implies a violation of the epipolar constraint when this approximate value of F is used, and therefore the epipolar lines (given by Eqs. 7.117 and 7.118) do not exactly intersect in the their epipoles. It is, therefore, advisable to correct the F matrix obtained from the decomposition of A with SVD, effectively reapplying a new SVD decomposition directly on the first estimate of F to obtain a ˆ which minimizes the Frobenius norm,9 as follows: new estimate F, min  F − Fˆ  F F

subject to

det (F) = 0

(7.130)

9 The

Frobenius norm is an example of a matrix norm that can be interpreted as the norm of the vector of the elements of a square matrix A given by     r  r  n     n  2 ai j = T r (AT A) = λi = σi2  A F = i=1 j=1

i=1

i=1

where A is the n × n square matrix of real elements, r ≤ n is the rank of A, λi = σi2 is the ith √ nonzero eigenvalues of AT A, and σi = λi is the ith singular value of the SVD decomposition of ∗ A. It should be considered T r ( A A) with the transposed conjugate A∗ in the more general case.

640

where

7 Camera Calibration and 3D Reconstruction

Fˆ = UV T

(7.131)

Found with (7.131) the matrix of rank 2 which approximates F, to obtain the matrix of rank 2 with the closest approximation to F, the third singular value, that is, σ 33 = 0 of the last SVD decomposition of F. Therefore, the best approximation is obtained, recalculating F with the updated matrix , as follows: ⎡ ⎤ σ11 0 0 F = UV T = U ⎣ 0 σ22 0⎦ VT (7.132) 0 0 0 This arrangement which forces the approximation of F to rank 2 allows to reduce as much as possible the error to map a point to the epipolar line between the stereo images and that all the epipolar lines converge in the relative epipoles. In reality, this trick reduces the error as much as possible but does not eliminate it completely. The essential matrix E is calculated when the correspondences pl = (xl , yl , 1) ↔ pr = (xr , yr , 1) are expressed in homogeneous image coordinates of the calibrated cameras. The calculation procedure is identical to that of the fundamental matrix using the 8algorithm (or more points) so that the correspondences satisfy Eq. (7.106) of the epipolar constraint pTR E p L = 0. Therefore, indicating with e = (e11 , . . . , e33 ) the 9-dimensional vector that groups the unknown elements of E, we can get a system homogeneous of linear equations analogous to the system (7.128) that is written in compact form: Be = 0 (7.133) where B is the data matrix of the correspondences of the size n × 9, the analog of the matrix A of the system (7.128) relative to the fundamental matrix. As with the fundamental matrix, the least squares solution of e corresponds to the smallest singular value of the SVD decomposition of B = UV T , taking the components of the last column vector of V (which corresponds to the smallest eigenvalue). The same considerations on data noise remain, so the solution obtained may not satisfy the requirement that the essential matrix obtained is not exactly of rank 2, and therefore it is also convenient for the essential matrix to reapply the SVD decomposition directly on the first estimate of E to get a new estimate given by Eˆ = UV T . The only difference in the calculation procedure concerns the different properties between the two matrices. Indeed, the essential matrix with respect to the fundamental has the further constraint that its two nonzero singular values are equal. To take this into account, the diagonal matrix is modified by imposing  = diag(1, 1, 0) and the essential matrix is E = Udiag(1, 1, 0)V T which is the best approximation of the normalized essential matrix that minimizes the Frobenius norm. It is also shown that, if from the SVD decomposition Eˆ = UV T we have that  = diag(a, b, c)

7.5 Stereo Vision and Epipolar Geometry

641

a+b with a ≥ b ≥ c, and we set  = diag( a+b 2 , 2 , 0) the essential matrix is put back with the following E = UV T , which is the most approximate essential matrix in agreement to the Frobenius norm.

7.5.3.2 7-Point Algorithm With the same approach of the 8-algorithm, it is possible to calculate the essential and fundamental matrix considering only 7 correspondences since the fundamental matrix has 7 degrees of freedom as described in the previous paragraph. This means that the respective data matrices A and B are of size 7 × 9 and have rank 7 in general. Resolving, in this case, the homogeneous system A f = 0 with 7 correspondences, the system presents a set of two-dimensional solutions, generated by two bases f 1 and f 2 (calculated with SVD and belonging to the null space of A) which corresponds to two matrices F 1 and F 2 . The two-dimensional solution of the system is in the form f = α f 1 + (1 − α) f 2 with α scalar variable. The solution expressed in matrix form is (7.134) F = α F 1 + (1 − α)F 2 For F, we can impose the constraint that det (F) = 0 for which we have det (α F 1 + (1 − α)F 2 ) = 0 such that F has rank 2. This constraint leads to a nonlinear cubic equation with the unknown α with notes F 1 and F 2 . The solutions of this equation for real α are in numbers of 1 or 3. In the case of 3 solutions, these must be verified replacing them in (7.134) and not considered the degenerate ones. Recall that the essential matrix has 5 degrees of freedom and can be set as previously a homogeneous system of linear equations Be = 0 with the data matrix B of size 5 × 9 built with only 5 correspondences. Compared to overdetermined systems, its implementation is more complex. In [16], an algorithm is proposed for the estimation of E from just 5 correspondences.

7.5.4 Normalization of the 8-Point Algorithm The 8-point algorithm, described above for the estimation of essential and fundamental matrices, uses the basic least squares approach and if the error in experimentally determining the coordinates of the correspondences is contained, the algorithm produces acceptable results. As with all algorithms, to reduce the numerical instability, due to data noise and above all as in this case when the coordinates of the correspondences are expressed with a large numerical amplitude (data matrix badly conditioned by SVD by altering the singular values to be equal and others cleared), it is, therefore, advisable to activate a normalization process on the data before applying the 8-point algorithm [14].

642

7 Camera Calibration and 3D Reconstruction

This normalization process consists in applying to the coordinates a translation transformation that establishes a new reference system with origin in the centroid of the points in the image plane. Subsequently, the coordinates are scaled to have a quadratic mean value of the distance of the points from the centroid of 1–2 pixels. This transformation, a combination of scaling and translation of the origin of the data, is carried out through two transformation matrices T L and T R for the two stereo left and right images, respectively. We indicate with uˆ = (uˆ i , vˆ i , 1) the nor mali zed coordinates, expressed in pixels, of a point ith in the image plane, with T the transformation matrix that normalizes the input coordinates u = (u i , vi , 1), the transformation equation is given by ⎤ ⎡ ⎤ ⎡ u i −μu ⎤ ⎡ 1 μu ⎡ ⎤ uˆ i ui μd μd 0 − μd ⎢ vi −μv ⎥ μv ⎥ ⎣ ⎦ 1 ⎣ vˆ i ⎦ = ⎢ (7.135) = ⎣ μd ⎦ ⎣ 0 μd − μd ⎦ vi = T ui 1 1 1 0 0 1 where the centroid (μu , μv ) and the average distance from the centroid μd are calculated for n points as follows: n 1 μu = ui n i=1

n 1 μv = vi n

!n μd =

i=1

i=1

 u i − μu )2 + (vi − μv )2 (7.136) n

According to (7.135) and (7.136), the normalization matrices T L and T R relative to the two stereo cameras are computed and then normalized correspondence coordinates as follows: uˆ R = T R u R (7.137) uˆ L = T L u L After the normalization of the data, the fundamental matrix F n is estimated with the approach indicated above and subsequently needs to denormalize it to be used with the original coordinates. The denormalized version F is obtained from the epipolar constraint equation as follows: u TR Fu L = uˆ TR T −T FT −1 uˆ L = uˆ TR F n uˆ L = 0  R  L 

=⇒

F = T TR F n T L

(7.138)

Fn

7.5.5 Decomposition of the Essential Matrix With the 8-point algorithm (see Sect. 7.5.3), we have calculated the fundamental matrix F and knowing the matrices K of the stereo cameras it is possible to calculate with (7.115) the essential matrix E. Alternatively, E can be calculated directly with 7.106) which we know to include the extrinsic parameters, that is, the rotation matrix R and the translation vector T . R and T are just the result of the decomposition of E we want to accomplish. Recall from 7.104 that the essential matrix E can be expressed in the following form: E = [T ]× R

(7.139)

7.5 Stereo Vision and Epipolar Geometry

643

which suggests that we can decompose E into two components, the vector T expressed in terms of the antisymmetric matrix [T ]× and the rotation matrix R. By virtue of the theorems demonstrated in [17,18] we have Theorem 7.1 An essential matrix E of size 3 × 3 can be factored as the product of a rotation matrix and a nonzero antisymmetric matrix, if and only if, E has two equal nonzero singular values and a null singular value. Theorem 7.2 Suppose that E can be factored into a product RS, where R is an orthogonal matrix and S is an antisymmetric matrix. Let be the SVD of E given by E = UV T , where  = diag(k, k, 0). Then, up to a scale factor, the possible factorization is one of the following: S = U ZU T

R = UWVT

or

R = UWT VT

E = RS

(7.140)

where W and Z are rotation matrix and antisymmetric matrix, respectively, defined as follows: ⎡ ⎤ ⎡ ⎤ 0 10 0 −1 0 W = ⎣−1 0 0⎦ Z = ⎣1 0 0 ⎦ (7.141) 0 01 0 0 0 Since the scale of the essential matrix does not matter, it, therefore, has 5 degrees of freedom. The reduction from 6 to 5 degrees of freedom produces an extra constraint on the singular values of E, moreover we have that det (E) = 0 and finally since the scale is arbitrary we can assume both singular values equal to 1 and having an SVD given by (7.142) E = Udiag(1, 1, 0)V T But this decomposition is not unique. Furthermore, being U and V orthogonal matrices det (U) = det (V T ) = 1 and if we have an SVD like (7.142) with det (U) = det (V T ) = −1 then we can change the sign of the last column of V . Alternatively, we can change the sign to E and then get a following SVD −E = Udiag(1, 1, 0)(−V )T with det (U) = det (−V T ) = 1. It is highlighted that the SVD for −E generates a different decomposition since it is not unique. Now let’s see with the decomposition of E according to (7.142) the possible solutions considering that ZW = diag([1

1

0])

ZW T = −diag([1

1

0])

(7.143)

and the possible solutions are E = S1 R1 , where S1 = −U ZU T

R1 = U W T V T

(7.144)

R2 = U W V T

(7.145)

and E = S2 R2 , where S2 = U ZU T

644

7 Camera Calibration and 3D Reconstruction

Now let’s see if these are two possible solutions for E by first checking if R1 and R2 are rotation matrices. In fact, remembering the properties (see Note 1) of the rotation matrices must result in the following: T  R1T R1 = U W T V T U W T V T = V W U T U W T V T = I

(7.146)

and therefore R1 is orthogonal. It must also be shown that the det (R1 ) = 1:   det (R1 ) = det U W T V T = det (U)det (W T )det (V T ) = det (W )det (U V T ) = 1

(7.147)

To check instead that S1 is an antisymmetric matrix must be S1 = −S1T . Therefore, we have  T (7.148) −S1T = U ZU T = U Z T U T = −U ZU T = S1 To verify that the possible decompositions are valid or that the last equation of (7.140) is satisfied, we must get E = S1 R1 = S2 R2 by verifying  S1 R1 = −U ZU T U W T V T = −U ZW T V T = −U −diag([1 1     

 0]) V T = E 

(7.149)

f or the equation (7.143)

By virtue of (7.142), the last step of the (7.149) shows that the decomposition S1 R1 is valid. Similarly it is shown that the decomposition S2 R2 is also valid. Two possible solutions have, therefore, been reached for each essential matrix E and it is proved to be only two [10]. Similarly to what has been done for the possible solutions of R we have to examine the possible solutions for the translation vector T which can assume different values. We know that T is encapsulated in S the antisymmetric matrix, such that S = [T ]× , obtained from the two possible decompositions. For the definition of vector product we have (7.150) ST = [T ]× T = U ZU T T = T × T = 0 Therefore, the vector T is in the null space of S which is the same as the null space of the matrices S1 and S2 . It follows that the searched estimate of T from this decomposition, by virtue of (7.150), corresponds to the third column of U as follows10 : ⎡ ⎤ 0 T = U ⎣0⎦ = u3 (7.151) 1

the decomposition predicted by (7.140), it must result that the solution of T = U[0 0 1]T since it must satisfy (7.150) that is ST = 0 according to the property of an antisymmetric matrix. In fact, for T = u3 the following condition is satisfied: 10 For

" " " 0 −1 0 " T 0 0 ""U u3 =[u2 −u1 0][u1 u2 u3 ]T u3 =u2 u1T u3 −u1 u2T u3 =0 0 0 0

Su3 =U ZU T u3 =U "" 1 .

7.5 Stereo Vision and Epipolar Geometry Fig. 7.10 The 4 possible solutions of the pair R and T in the decomposition of E. A reversal of the baseline (inverted optical centers) is observed horizontally while vertically we have a rotation of 180◦ around the baseline. Only the a configuration correctly reconstructs the 3D point of the scene being in front of both cameras

645

(a)

(b)

cL

cR

cR

(c)

cL

(d) cL

c’R c’L

cR

Let us now observe that if T is in the null space of S the same is for λT , in fact for any nonzero value of λ we have a valid solution since we will have [λT ]× R = λ[T ]× R = λE

(7.152)

which is still a valid essential matrix defined less than an unknown scale factor λ. We know that this decomposition is not unique given the ambiguity of the sign of E and consequently also the sign of T is determined considering that S = U(±Z)U T . Summing up, for a given essential matrix, there are 4 possible choices of projection matrices P R for the right camera, since there are two choice options for both R and T , given by the following:



P R = U W V T | ± u3 or U W T V T | ± u3 (7.153) By obtaining 4 potential pairs (R, T ) there are 4 possible configurations of the stereo system by rotating the camera in a certain direction or in the opposite direction with the possibility of translating it in two opposite directions as shown in Fig. 7.10. The choice of the appropriate pair is made for each 3D point to be reconstructed by triangulation by selecting the one where the points are in front of the stereo system (in the direction of the positive z axis).

7.5.6 Rectification of Stereo Images With epipolar geometry, the problem of searching for homologous points is reduced to mapping a point of an image on the corresponding epipolar line in the other image. It is possible to simplify the problem of correspondence through a one-dimensional point-to-point search between the stereo images. For example, we can execute an appropriate geometric transformation (e.g., projective) with resampling (see Sect. 3.9 Vol. II) on stereo images such as to make the epipolar lines parallel and thus sim-

646

7 Camera Calibration and 3D Reconstruction

plify the search for homologous points as a 1D correspondence problem. This also simplifies the correlation process that evaluates the similarity of the homologous patterns (described in Chap. 1). This image alignment process is known as rectification of stereo images and several algorithms have been proposed based on the constraints of epipolar geometry (using uncalibrated cameras where the fundamental matrix includes intrinsic parameters) and on the knowledge of intrinsic and extrinsic parameters of calibrated cameras. Rectification algorithms with uncalibrated cameras [10,19] perform without the explicit camera parameter information, implicitly included in the essential and fundamental matrix used for image rectification. The nonexplicit use of the calibration parameters makes it possible to simplify the search for homologous points by operating on the aligned homography projections of the images but for the 3D reconstruction we have the problem that objects observed from different scales or from different perspectives may appear identical in the homography projections of aligned images. In the approaches with calibrated cameras, intrinsic and extrinsic parameters are used to perform geometric transformations to horizontally align the cameras and make the epipolar lines parallel to the x-axis. In essence, the images transformed for alignment can be thought of as reacquired with a new configuration of the stereo system where the alignment takes place by rotating the cameras around their optical axes with the care of minimizing distortion errors in perspective reprojections.

7.5.6.1 Rectification Not Calibrated Consider the initial configuration of stereo vision with the cameras arranged with the parallel optical axes and therefore with the image planes coplanar and vertically aligned (known as canonical or lateral configuration, see Fig. 7.6). We assume that the calibration matrix K of the cameras is the same and the essential matrix E (for example, first calculated with the SVD method described in the previous paragraphs) is known. Since the cameras have not been rotated between them, we can assume that R = I, where I is the identity matrix. If b is the baseline (distance between optical centers) we have that T = (b, 0, 0) and considering (7.104) we get ⎡ ⎤ 00 0 (7.154) E = [T ]× R = ⎣0 0 −b⎦ 0b 0 and according to the equation of the epipolar constraint (7.106) we have pTR E p L



= xR

⎡ ⎤⎡ ⎤ ⎡ ⎤ xL 0

00 0

y R 1 ⎣0 0 −b⎦ ⎣ y L ⎦ = x R y R 1 ⎣ b ⎦ =⇒ by R = by L 0b 0 −by L 1       lR

(7.155)

lR

from which it emerges that the y vertical coordinate is the same for the homologous points and the equation of the epipolar line l R = (0, −b, by L ) associated to the point p L is horizontal. Similarly, we have for the epipolar line l L = E T p R = (0, b, −by R )

7.5 Stereo Vision and Epipolar Geometry

647

associated with the point p R . Therefore, a 3D point of the scene always appears on the same line in the two stereo images. The same result is obtained by calculating the fundamental matrix F for the parallel stereo cameras. Indeed, assuming for the two cameras, the perspective projection matrices have P L = K L [I | 0]

P R = K R [R | T ] with K L = K R = I

R=I

T = (b, 0, 0)

where b is the baseline. By virtue of Eq. (7.120), we get the fundamental matrix: ⎡

F=

K −T R

[T ]× R    E

K −1 L

⎤⎡ ⎤⎡ ⎤ ⎡ ⎤ 100 00 0 100 00 0 = ⎣0 1 0⎦ ⎣0 0 −b⎦ ⎣0 1 0⎦ = ⎣0 0 −1⎦ 001 0b 0 001 01 0

(7.156)

and according to the equation of the epipolar constraint (7.116) we have u TR Fu L



= uR

⎡ ⎤⎡ ⎤ ⎡ ⎤ uL 0

00 0

⎣ ⎦ ⎣ ⎣ ⎦ v R 1 0 0 −1 vL = u R v R 1 −1 ⎦ =⇒ v R = v L 01 0 −v L 1       lR

(7.157)

lR

We thus have that even with F the vertical coordinate v is the same for the homologous points and the equation of the epipolar line l R = (0, −1, −v L ) associated with the point u L is horizontal. Similarly, we have for the epipolar line l L = F T u R = (0, 1, −v R ) associated with the point u R . Now let’s see how to rectify the stereo images acquired in the noncanonical configuration, with the converging and non-calibrated cameras, of which we can estimate the fundamental matrix (with the 8-point normalized algorithm) and consequently calculate the epipolar lines relative to the two images for the similar points considered. Known the fundamental matrix and the epipolar lines, it is then possible to calculate the relative epipoles.11 At this point, having known the epipoles e L and e R , we can already check if the stereo system is in the canonical configuration or not. From the epipolar geometry, we know (from Eq. 7.119) that the epipole is the vector in the null space of the fundamental matrix F for which F · e = 0. Therefore, from (7.156) the fundamental matrix of a canonical configuration is known and in this case we will have ⎡ ⎤⎡ ⎤ 00 0 1 F·e = ⎣0 0 −1⎦ ⎣0⎦ = 0 (7.158) 01 0 0

11 According to epipolar geometry, we know that the epipolar lines intersect in the relative epipoles.

Given the noise is present in the the correspondence coordinates, in reality the epipolar lines intersect not in a single point e but in a small area. Therefore, it is required to optimize the calculation of the position of each epipole considering the center of gravity of this area and this is achieved with the least squares method to minimize this error. Remembering that each line is represented with a 3D vector of the type l i = (ai , bi , ci ) the set of epipolar lines {l 1 , l 2 , . . . , l n } can be grouped in a n × 3 matrix L and form a homogeneous linear system L · e = 0 in the unknown the epipole vector e, solvable with the SVD (singular value decomposition) method.

648

7 Camera Calibration and 3D Reconstruction

Fig. 7.11 Rectification of stereo image planes. Stereo images, acquired from a noncanonical stereo configuration, are reprojected into image planes that are coplanar and parallel to the baseline. The epipolar lines correspond to the lines of the rectified images

P pL

pR

HL

lR cL pL

eR

eL

Plans Parallel Epipolar Lines

eL

⎡ ⎤ 1 e = ⎣0⎦ 0

for which

HR cR pR eR

(7.159)

is the solution vector of the epipole corresponding to the configuration with parallel cameras, parallel epipolar lines, and epipole at infinity in the horizontal direction. If the configuration is not canonical, it is necessary to carry out an appropriate homography transformation for each stereo image to make them coplanar with each other (see Fig. 7.11), so as to obtain each epipole at infinity along the horizontal axis, according to (7.159). If we indicate with H L and H R the homography transforms that, respectively, correct the original image of left and right and indicate with uˆ L and uˆ R the homologous points in the rectified images, these are defined as follows: uˆ L = H L u˜ L

uˆ R = H R u˜ R

(7.160)

where u˜ L and u˜ R are homologous points in the original images of the noncanonical stereo system of which we know F. We know that the latter satisfy the constraint of the epipolar geometry given by (7.116) so considering Eq. (7.160) we have  T   F uˆ L H −1 F H −1 uˆ L = 0 = uˆ TR H −T u˜ TR F u˜ L = uˆ R H −1 R L  R  L 

(7.161)

ˆ F

ˆ for the rectified images must from which we have, that the fundamental matrix F, result, according to (7.156), to the following factorization: ⎡ ⎤ 00 0 ˆ = H −T F H −1 = ⎣0 0 −1⎦ F (7.162) R L 01 0 Therefore, find the homography transforms H L and H R that satisfy (7.162) the images are rectified obtaining the epipoles to infinity as required. The problem is

7.5 Stereo Vision and Epipolar Geometry

649

that these homography transformations are not unique and if chosen improperly they generate distorted rectified images. One idea is to consider homography transformations as rigid transformations by rotating and translating the image with respect to a point of the image (for example, the center of the image). This is equivalent to carrying out the rectification with the techniques described in Chap. 3 Vol. II with linear geometric transformations and image resampling. In [19], an approach is described which minimizes distortions of the rectified images by decomposing the homographs into elementary transformations: H = H p Hr Hs where H p indicates a projective transformation, H r a similarity transformation and H s indicates a shearing transformation (i.e., takes into account the deformations which inclines the flat shape of an object along the coordinate axes u or v or in both). In [10] is proposed a method of rectification that first performs a homography transformation on the image on the right in order to obtain the combined effect of a rigid roto-translation around the center of the image (in homogeneous coordinates (0, 0, 1)) followed by a transformation that takes any point in ( f, 0, 1) and maps it to infinity in ( f, 0, 0) along the horizontal axis of u. In particular, if we consider the image center in homogeneous coordinates (0, 0, 1) as the reference point, the pixel coordinates of the image on the right are translated with the matrix T given by ⎡ ⎤ 1 0 −L/2 T = ⎣0 1 −H/2⎦ (7.163) 01 1 where H and L are the height and width of the image, respectively. After applying the translation, we apply a rotation R to position the epipole on the horizontal axis at a certain point ( f, 0.1). If the translated epipole T e R is in position (e Ru , e Rv ,1 ) the rotation applied is ⎡ ⎤ e e α  2 Ru 2 α  2 Rv 2 0 e Ru +e Rv e Ru +e Rv ⎥ ⎢ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ e Ru e Rv (7.164) R = ⎢−α  2 2 α  2 2 0⎥ ⎢ e Ru +e Rv e Ru +e Rv ⎥ ⎢ ⎥ ⎣ ⎦ 0 0 1 where α = 1 if e Ru ≥ 0 and α = −1 otherwise. After the roto-translation T · R any point located in ( f, 0, 1) to map it to the infinite point ( f, 0, 0) along the axis of the u needs to apply the following transformation G: ⎤ ⎡ 1 00 (7.165) G = ⎣ 0 1 0⎦ − 1f 0 1

650

7 Camera Calibration and 3D Reconstruction

Therefore, the homography transformation H R for the right image is given by the combination of the three elementary transformations as follows: H R = G RT

(7.166)

which represents the rigid transformation of the first order with respect to the image center. At this point, having known the homography H R , we need to find an optimal solution for the homography H L such that the images rectified with these homographs are very similar with less possible distortions. This is possible by searching for the homography H L , which minimizes the difference of the adjusted images by setting a function that minimizes the sum of the square of the distances between homologous points of the two images:  min  H L u L i − H R u Ri 2 (7.167) HL

i

Without giving the algebraic details described in [10], it is shown that the homography H L can be expressed in the form: HL = HAHRM

(7.168)

assuming that the fundamental matrix F of the stereo pair of input images is known, which we express as F = [e]× M (7.169) while the H A matrix is given by: ⎡ ⎤ a1 a2 a3 HA = ⎣ 0 1 0 ⎦ 0 0 1

(7.170)

where the generic vector a = (a1 , a2 , a3 ) will be defined later. H A expressed by (7.170) represents the affine transformation component included in the compound transformation of H L in (7.168). Now let’s show what the matrix M represents. First of all, we highlight that under the property of an antisymmetric matrix A we have that A = A3 less than a scale factor. Since any vector e can be represented as an 3 × 3 antisymmetric matrix [e]× unless a scale factor such as the matrix F, we can apply these properties to (7.169) and have F = [e]× M = [e]× [e]× [e]× M = [e]× [e]× F   

(7.171)

F

from which we get M = [e]× F

(7.172)

7.5 Stereo Vision and Epipolar Geometry

651

We observe that if a multiple vector of e is added to the columns of M then (7.171) remains valid even less than a scale factor. Therefore, the most general form to define M is as follows: (7.173) M = [e]× F + evT where v is a generic 3D vector. Normally, M is defined by putting it equal to (1, 1, 1) with good results. Now it remains to define H A , that is, the vector a introduced in (7.170) to estimate H L given by (7.168). This is accomplished by considering that the initial goal was to minimize the function (7.167) by adequately finding H L and H R . Now we know H R (the homography matrix that maps the epipole e R to an infinite point in (1, 0, 0)) and M, so we can write transformations for homologous points of the two images in the form: uˆ Ri = H R u Ri uˆ L i = H R Mu L i and then the minimization problem results:  min  H A uˆ L i − uˆ Ri 2 HA

(7.174)

i

If the points are expressed in homogeneous coordinates like uˆ L i = (u L i , v L i , 1) and uˆ Ri = (u Ri , v Ri , 1), then the minimization function can become:  min (a1 uˆ L i + a2 vˆ L i + a3 − uˆ Ri )2 + (ˆv L i − vˆ Ri )2 (7.175) a

i

It is observed that vˆ L i − vˆ Ri is a constant value for which the minimization problem is further reduced in the form:  min (a1 uˆ L i + a2 vˆ L i + a3 − uˆ Ri )2 (7.176) a

i

and finally, the minimization problem can be set up as a simple least squares problem solving a system of linear equations, where the unknowns are the components of the vector a, given by ⎤ ⎤ ⎡ ⎡ uˆ R1 uˆ L 1 vˆ L 1 1 ⎡a1 ⎤ ⎢ .. .. .. ⎥ ⎣ ⎦ ⎢ .. ⎥ (7.177) Ua = b ⇐⇒ ⎣ . . . ⎦ a2 = ⎣ . ⎦ a3 uˆ R uˆ L vˆ L 1 n

n

n

Once we have calculated the vector a with (7.170), we can calculate H A , estimate H L with (7.168) and with the other homography matrix H R already calculated we can rectify each pair of stereo images acquired with the n correspondences used. We summarize the whole procedure of the rectification process of stereo images, based on homography transformations, applied to a pair of images acquired by a

652

7 Camera Calibration and 3D Reconstruction

stereo system (in the noncanonical configuration) of which we know the epipolar geometry (the fundamental matrix) for which the epipolar lines in the input images are mapped horizontally in the rectified images. The essential steps are 1. Find n ≥ 7 initial correspondences u L ↔ u R in the two stereo input images. We know that for the estimate of F it is better if n > 7. 2. Estimate the fundamental matrix F and find the epi poles e L and e R in the two images. 3. Calculate the homography transformation H R which maps the epi pole e R to infinity in (1, 0, 0)T . 4. Find the transformation matrix H L that minimizes the function (7.167), that is, it minimizes the sum of the square of the distances of the transformed points. 5. Found the best transformations H L and H R , rectify (geometric transformation with resampling) the respective stereo images of left and right.

7.5.6.2 Calibrated Rectification We now describe the process of rectification (nonphysical) of stereo images in which the intrinsic parameters (the calibration matrix K ) and extrinsic parameters (the rotation matrix R and the translation vector T ) of each camera can be estimated using the methods described in Sect. 7.4. In practice, if the cameras are fixed on a mobile turret with 3 degrees of freedom, it is possible to configure the canonical stereo system thus directly acquiring the rectified images with the accuracy dependent on the accuracy of the camera’s attitude. In the general stereo configuration, with knowns R and T , it is necessary to realize geometric transformations to the stereo images to make the epipolar lines collinear and parallel to the horizontal axis of the images. In essence, these transformations rectify the stereo images by simulating a virtual stereo acquisition from a canonical stereo system through the rotation of the cameras with respect to their optical center (see Fig. 7.12). The essential steps of this method, proposed in [20], are the following: 1. Calibrate the cameras to get K , R and T and derive the calibration parameters of the stereo system. 2. Compute the rotation matrix Rr ect with which to rotate the left camera to map the left epipole e L to infinity along the x-axis and thus making the epipolar lines horizontal. 3. Apply the same rotation to the right camera. 4. Calculate for each point of the left image the corresponding point in the new canonical stereo system. 5. Repeat the previous step even for the right camera. 6. Complete the rectification of the stereo images by adjusting the scale and then resample.

7.5 Stereo Vision and Epipolar Geometry

653

P

PLr Plans PL PRr cL

RRECT PR RRECT cR

Fig. 7.12 Rectification of the stereo image planes knowing the extrinsic parameters of the cameras. The left camera is rotated so that the epipole moves to infinity along the horizontal axis. The same rotation is applied to the camera on the right, thus obtaining plane images parallel to the baseline. The horizontal alignment of the epipolar lines is completed by rotating the right camera according to R−1 and possibly adjusting the scale by resampling the rectified images

Step 1 calculates the parameters (intrinsic and extrinsic) of the calibration of the individual cameras and the stereo system. Normally the cameras are calibrated considering known 3D points, defined with respect to a world reference system. We indicate with P w (X w , Yw , Z w ) the coordinates in the world reference system, and with R L , T L and R R , T R the extrinsic parameters of the two cameras, respectively, the rotation matrices and the translation column vectors. The relationships that project the point P w in the image plane of the two cameras (according to the pinhole model), in the respective reference systems, are the following: P L = RL Pw + T L

(7.178)

P R = RR Pw + T R

(7.179)

We assume that the two cameras have been independently calibrated with one of the methods described in Sect. 7.4, and therefore their intrinsic and extrinsic parameters are known. If T is the column vector representing the translation between the two optical centers (the origins of each camera’s reference systems) and R is the rotation matrix that orients the right camera axes to those of the left camera, then the relative coordinates of a 3D point P(X, Y, Z ) in the space, indicated with P L = (X L p , Y L p , Z L p ) and P R = (X R p , Y R p , Z R p ) in the reference system of the two cameras, are related to each other with the following: P L = RT P R + T

(7.180)

654

7 Camera Calibration and 3D Reconstruction

The extrinsic parameters of the stereo system are computed with Eqs. (7.98) and (7.99) (derived in Sect. 7.4.4) that we rewrite here R = R TL R R

(7.181)

T = T L − RT T R

(7.182)

In the step 2, the rotation matrix Rr ect is calculated for the left camera which has the purpose of mapping the relative epipole to infinity in the horizontal direction (x axis) and obtain the horizontal epipolar lines. From the property of the rotation matrix we know that the column vectors represent the orientation of the rotated axes (see Note 1). Now let’s see how to calculate the three vectors r i of Rr ect . The new x-axis must have the direction of the translation column vector T (the baseline vector joining the optical centers) given by the following unit vector: ⎡ ⎤ Tx 1 T ⎣ Ty ⎦ = (7.183) r1 = T T2 + T2 + T2 T x

y

z

z

The second vector r 2 (which is the direction of the new y axis) has the constraint of being only orthogonal to r 1 . Therefore it can be calculated as the normalized vector product between r 1 and the direction vector (0, 0, 1) of the old axis of z (which is the direction of the old optical axis), given by ⎡ ⎤ −Ty T 1 r 1 × [0 0 1] ⎣ Tx ⎦ = r2 = (7.184)  r 1 × [0 0 1]T  2 T + T2 0 x

y

The third vector r 3 represents the new z-axis which must be orthogonal to the baseline (vector r 1 ) and to the new axis of y (the vector r 2 ), so we get as the vector product of these vectors: ⎤ ⎡ −Tx Tz 1 ⎣ −Ty Tz ⎦ r3 = r1 × r2 =  (7.185) 2 2 (Tx + Ty )(Tx2 + Ty2 + Tz2 ) Tx2 + Ty2 This results in the rotation matrix given by ⎡

Rr ect

⎤ r 1T = ⎣ r 2T ⎦ r 3T

(7.186)

We can verify the effect of the rotation matrix Rr ect on the stereo images to be rectified as follows. Let us now consider the relationship (7.180), which orients the

7.5 Stereo Vision and Epipolar Geometry

655

right camera axes to those of the left camera. Applying to both members Rr ect we have Rr ect P L = Rr ect R T P R + Rr ect T

(7.187)

from which it emerges that in fact the coordinates of the points of the image of the left and the right are rectified, by obtaining P L r = Rr ect P L

P Rr = Rr ect R T P R

(7.188)

having indicated with P L r and P Rr the rectified points, respectively, in the reference system of the left and right camera. The correction of the points, according to (7.188), is obtained considering that ⎡ T ⎤ ⎡ ⎤ r1 T T  Rr ect T = ⎣ r 2T T ⎦ = ⎣ 0 ⎦ (7.189) 0 r 3T T hence replacing in (7.187) we get ⎡

P L r = P Rr

⎤ T  +⎣ 0 ⎦ 0

(7.190)

(7.190) shows that the rectified points have the same coordinates Y and Z and differ only in the horizontal translation along the X -axis. Thus the steps 2 and 3 are made. The corresponding 2D points rectified in the left and right image planes are obtained instead from the following: pLr =

f P Lr ZL

p Rr =

f P Rr ZR

(7.191)

Thus steps 4 and 5 are realized. Finally, with step 6, to avoid empty areas in the rectified images, the inverse geometric transformation (see Sect. 3.2 Vol.II) is activated to associate in the rectified images the pixel value of the stereo input images and possibly resample if in the inverse transform the pixel position is between 4 pixels in the input image.

7.5.7 3D Stereo Reconstruction by Triangulation The 3D reconstruction of the scene can be realized in different ways, in relation to the knowledge available to the stereo acquisition system. The 3D geometry of the scene can be reconstructed, without ambiguity, given the 2D projections of the homologous points of the stereo images, by triangulation, known the calibration parameters (intrinsic and extrinsic) of the stereo system. If instead only the intrinsic parame-

656

7 Camera Calibration and 3D Reconstruction

ters are known the 3D geometry of the scene can be reconstructed by estimating the extrinsic parameters of the system to less than a not determinable scale factor. If the calibration parameters of the stereo system are not available but only the correspondences between the stereo images are known, the 3D structure of the scene is recovered through an unknown homography transformation.

7.5.7.1 3D Reconstruction Known the Intrinsic and Extrinsic Parameters Returning to the reconstruction by triangulation with the stereo system [20], this is directed knowing the calibration parameters, the correspondences of the homologous points p L and p R in the image planes, and the linear equations of homologous rays passing, respectively, for the points p L and optical center C L related to the left camera, and for the points p R and optical center C R related to the right camera. The estimate of the coordinates of a 3D point P = (X, Y, Z ) is obtained precisely by triangulation of the two rays which in ideal conditions intersect at the point P. In reality, the errors on the estimation of the intrinsic parameters and on the determination of the position of the projections of P in the stereo image planes cause no intersection of the rays even if their minimum distance is around P as shown in Fig. 7.13a, where PˆL and PˆR represent the ends of the segment of the minimum distance between the rays, to be determined. Therefore, it is necessary to obtain an estimate Pˆ of P as the midpoint of the segment of a minimum distance between homologous rays. It can be guessed that using multiple cameras having a triangulation with more rays would improve the estimate of P by calculating the minimum distance in the sense of least squares (the sum of the squared distances is zero if the rays are incident in P). Let us denote by l L and l R the nonideal rays, respectively, of the left and right camera, passing through their optical centers C L and C R and its projections p L and (a)

(b) TpR

PR =

L

L

yl

L

Ra

Ra

cL

R

uL cR

yl

yl

R

Distances to be minimized

pL pR

Ra

yl Ra

pL cL

P

P

P

PL=

pR

uR cR

Fig. 7.13 3D reconstruction with stereovision. a Triangulation by not-exact intersection of the rays l L and l R retroprojected for the 3D reconstruction of the point P; b Triangulation through the approach that minimizes the error of the observed projections p L and p R of the points P of the scene with respect to the projections u L and u R calculated with Eqs. (7.196) and (7.197) with the pinhole projection model defined by the known projection matrices of the cameras

7.5 Stereo Vision and Epipolar Geometry

657

p R in the image plane. Furthermore, we have evidence that there is only one segment of minimum length indicated with the column vector v, which is perpendicular to both rays joining them via the intersection points indicated with PˆL (extreme of the segment obtained from the 3D intersection between ray l L and segment) and PˆR (extreme of the segment obtained from the 3D intersection between radius l R and segment) as shown in the figure. The problem is then reduced to finding the coordinates of the extreme points PˆL and PˆR of the segment. We now express in the vector form a p L and b p R where a, b ∈ R, the equations of the two rays, in the respective reference systems, passing through the optical centers C L and C R , respectively. The extremes of the segment to be found are expressed with respect to the reference system of the left camera with origin in C L whereby according to (7.94) the equation of the right ray expressed with respect to the reference system of the left camera is R T b p R + T remembering that R and T represent the extrinsic parameters of the stereo system defined from Eqs. (7.98) and (7.99), respectively. The constraint that the segment, represented by the equation cv with c ∈ R, is orthogonal to the two rays defines the vector v obtained as a vector product of the two vectors/rays given by v = pL × RT p R

(7.192)

where also v is expressed in the reference system of the left camera. At this point, we have that the segment represented by cv will intersect the ray a p L for a given value of a0 thus obtaining the coordinates of PˆL , an extreme of the segment. a p L + cv represents the equation of the plane passing through the ray l L and to be orthogonal to the ray l R must be (7.193) a p L + cv = T + b R T p R for certain values of the unknown scalars a, b, and c that can be determined considering that the vector equation (7.193) can be set as a linear system of 3 equations (for three-dimensional vectors) in 3 unknowns. In fact, replacing the vector v given by the (7.192) we can solve the following system:   a pL + c pL × RT p R − b RT p R = T (7.194) If a0 , b0 , and c0 are the solution of the system then the intersection between the ray l L and the segment gives an extreme of the segment PˆL = a0 p L , while the other extreme is obtained from the intersection of segment and ray l R given by PˆR = T + b0 R T p R and the midpoint between the two extremes finally identifies the estimate of Pˆ reconstructed in 3D with the coordinates expressed in the reference system of the left camera.

658

7 Camera Calibration and 3D Reconstruction

7.5.7.2 3D Reconstruction Known Intrinsic and Extrinsic Parameters with Linear Triangulation An alternative method of reconstruction is based on the simple linear triangulation that solves the problem of the nonintersecting backprojected rays by minimizing the estimated backprojection error directly in the image planes (see Fig. 7.13b). Known the projection matrices P L = K L [I| 0] and P R = K R [R| T ] of the two cameras, the projections p L = (x L , y L , 1) and p R = (x R , y R , 1) of a point P of the 3D space with coordinates X = (X, Y, Z , 1), in the respective image planes are ⎡ T ⎤ ⎡ T ⎤ PL 1 X P R1 X p R = P R X = ⎣PTR2 X ⎦ (7.195) p L = P L X = ⎣PTL 2 X ⎦ PTL 3 X PTR3 X where P L i and P Ri indicate the rows of the two perspective projection matrices, respectively. The perspective projections in Cartesian coordinates u L = (u L , v L ) and u R = (u R , v R ) are uL =

uR =

PTL 1 X PTL 3 X PTR1 X PTR3 X

vL =

vR =

PTL 2 X PTL 3 X PTR2 X PTR3 X

From Eq. (7.196), we can derive two linear equations12 :   u L PTL 3 − PTL 1 X = 0   v L PTL 3 − PTL 2 X = 0 Putting in matrix form, we have   u L PTL 3 − PTL 1 X = 02×1 v L PTL 3 − PTL 2

(7.196)

(7.197)

(7.198)

(7.199)

12 The same equations can be obtained, for each camera, considering the properties of the vector product p × (PX) = 0, that is, by imposing the constraint of parallel direction between the vectors. Once the vector product has been developed, three equations are obtained but only two are linearly independent of each other.

7.5 Stereo Vision and Epipolar Geometry

659

Proceeding in the same way for the homologous point u R , from (7.197) we get two other linear equations that we can assemble in (7.199) and we thus have a homogeneous linear system with 4 equations, given by ⎤ u L PTL 3 − PTL 1 ⎢ v L PTL − PTL ⎥ 3 2⎥ ⎢ ⎣u R PT − PT ⎦ X = 04×1 R3 R1 v R PTR3 − PTR2 ⎡

⇐⇒

A4×4 X 4×1 = 04×1

(7.200)

where it is observed that each pair of homologous points gives the point P in the 3D space of coordinates X = (X, Y, Z , W ) with the fourth unknown component. Considering the noise present in the localization of homologous points, the solution of the system is found with the SVD method which estimates the best solution in the sense of least squares. With this method the 3D estimate of P can be improved by adding further observations: with N > 2 cameras. In this case two equations of the type (7.198) would be added to the matrix A for each camera thus obtaining a homogeneous system with 2N equations always in 4 unknowns, with the matrix A of size 2N × 4. Recall that the reconstruction of P based on this linear method minimizes the algebraic error without geometric meaning. To better filter the noise, present in the correspondences and in the perspective projection matrices, the optimal estimate can be obtained by setting a nonlinear minimization function (in the sense of the maximum likelihood estimation) as follows: min  P L Xˆ − u L 2 +  P R Xˆ − u R 2 Xˆ

(7.201)

where Xˆ represents the best estimate of the 3D coordinates of the point P. In essence, Xˆ is the best least squared estimate of the backprojection error of P in both images, seen as the distance in the image plane between its projection (for the respective cameras are given by Eqs. 7.196 and 7.197) and the related observed measurement of P always in the image plane (see Fig. 7.13b). In the function (7.201) the backprojection error for the point P is accumulated for both cameras and in the case of N cameras the error is added and the function to be minimized is min Xˆ

N 

 Pi Xˆ − ui 2

(7.202)

i=1

resolvable with iterative methods (for example, Gauss–Seidel, Jacobi,. . .) of nonlinear least squares approximation.

7.5.7.3 3D Reconstruction with only the Intrinsic Parameters Known In this case, of the stereo system with projection matrices P L = K L [I| 0] and P R = K R [R| T ], we know a set of homologous points and only the intrinsic parameters

660

7 Camera Calibration and 3D Reconstruction

K L and K R of the stereo cameras. The 3D reconstruction of the scene occurs less than an incognito scale factor because the cameras setups (cameras attitude) are not known. In particular, not knowing the baseline (the translation vector T ) of the stereo system it is not possible to reconstruct the 3D scene in the real scale even if the reconstruction is unique but unless an incognito scale factor. Known at least 8 corresponding points it is possible to calculate the fundamental matrix F and once the calibration matrices K are known it is possible to calculate the essential matrix E (alternatively, E can be calculated directly with 7.106) which we know to include the extrinsic parameters, that is, the rotation matrix R e the translation vector T . R and T are just the unknowns we want to calculate and then perform the 3D reconstruction by triangulation. The essential steps of the 3D reconstruction process, known the intrinsic parameters of the stereo cameras and a set of homologous points, are the following: 1. Detects a set of corresponding points (at least 8). 2. Estimate the fundamental matrix F with the normalized 8-point algorithm (see Sect. 7.5.3). 3. Compute the essential matrix E from the fundamental matrix F known the intrinsic parameter matrices K L and K R . 4. Estimate the extrinsic parameters, that is, the rotation matrix R and the translation vector T of the stereo system by decomposing E as described in Sect. 7.5.5. 5. Reconstruction of the position of 3D points by triangulation and appropriately selecting R and T among the possible solutions. In this context only step 4 is analyzed while the others are immediate since they have already been treated previously. From Sect. 7.5.5, we know that the essential matrix E = [T ]× R can be factored with the SVD method obtaining E = UV T where by definition the essential matrix has rank 2 and must admit two equal singular values and the third equals zero, so we have  = diag(1, 1, 0). We also know, from Eqs. (7.142) and (7.143), the existence of the rotation matrice W and the antisymmetric matrix Z such that their product is ZW = diag(1, 1, 0) = , producing the following result: T T E = U{}V T = U{ZW }V T = U ZU   U W V  = [T ]× R [T ]×

(7.203)

R

where the penultimate step is motivated by Eq. (7.140). The orthogonality characteristics of the obtained rotation matrix and the definition of the essential matrix are thus satisfied. We know, however, that the decomposition is not unique and E is defined unless a scale factor λ and the translation vector unless the sign. In fact, the decomposition leads to 4 possible solutions of R and T , and consequently we have 4 possible projection matrices P R = K R [R T ] of the stereo system for the right camera given by Eq. (7.153), which we rewrite as follows: [U W V T | λu3 ]

[U W V T | − λu3 ]

[U W T V T | λu3 ]

[U W T V T | − λu3 ]

(7.204)

7.5 Stereo Vision and Epipolar Geometry

661

where, according to Eq. (7.151), u3 = T corresponds to the third column of U. Obtained 4 potential pairs (R, T ) there are 4 possible configurations of the stereo system by rotating the camera in a certain direction or in the opposite direction, and with the possibility of translating it in two opposite directions as shown in Fig. 7.10. The choice of the appropriate pair is made for each 3D point to be reconstructed by triangulation by selecting the one where the points are in front of the stereo system (in the direction of the positive z axis). In particular, we consider each correspondence pair as reprojected backwards to identify the 3D point and determine its depth with respect to both cameras by choosing the solution where the depth is positive for both.

7.5.7.4 3D Reconstruction with Known Only Intrinsic Parameters and Normalizing the Essential Matrix As in the previous paragraph, the essential matrix E is calculated unless an unknown scale factor. A normalization procedure [20] of E is considered to normalize the length of the translation vector T to unity. From Eq. (7.104) of essential matrix E = [T ]× R = S R we have E T E = (S R)T S R = ST R T RS = ST S

(7.205)

where with S we have indicated the antisymmetric matrix associated with the translation vector T defined by (7.103). Expanding the antisymmetric matrix in (7.205) we have ⎡ 2 ⎤ Ty + Tz2 −Tx Ty −Tx Tz E T E = ⎣ −Ty Tx Tz2 + Tx2 −Ty Tz ⎦ (7.206) −Tz Tx −Tz Ty Tx2 + Ty2 which shows that the trace of E T E is given by T r (E T E) = 2  T 2

(7.207)

To normalize the translated vector to the unit, the essential matrix is normalized as follows: E Eˆ =  (7.208) T r (E T E)/2 while the normalized translation vector is given by Tˆ =



[Tx Ty Tz ]T T = = Tˆx Tˆy Tˆz T Tx2 + Ty2 + Tz2

(7.209)

662

7 Camera Calibration and 3D Reconstruction

According to the normalization defined with (7.208) and (7.209) the matrix (7.206) is rewritten as follows: ⎤ ⎡ 1 − Tˆx2 −Tˆx Tˆy −Tˆx Tˆz ⎥ ˆ =⎢ ˆTE (7.210) E ⎣ −Tˆy Tˆx 1 − Tˆy2 −Tˆy Tˆz ⎦ 2 ˆ ˆ ˆ ˆ ˆ −Tz Tx −Tz Ty 1 − Tz At this point, the components of the vector Tˆ can be derived from any row or column ˆTE ˆ given by (7.210). Indeed, by indicating it for simplicity with of the matrix E T ˆ the components of the translation vector Tˆ are derived from the following: ˆ E E= E  Tˆx = ± 1 − E11

E12 Tˆy = − Tˆx

E13 Tˆz = − Tˆx

(7.211)

Due to the quadratic elements of E for the components of Tˆ , the latter can differ from the real ones unless the sign. The rotation matrix R can be calculated by knowing the normalized essential matrix E and the normalized vector Tˆ albeit with the ambiguity in the sign. For this purpose the 3D vectors are defined: wi = Eˆ i × Tˆ

(7.212)

where Eˆ i indicates the three rows of the normalized essential matrix. From these vectors wi , through simple algebraic calculations, are calculated the rows of the rotation matrix given by ⎡ T⎤ ⎡ ⎤ R1 (w1 + w2 + w3 )T R = ⎣ R2T ⎦ = ⎣(w2 + w3 + w1 )T ⎦ (7.213) (w3 + w1 + w2 )T R3T Due to the double ambiguity in the sign of E and Tˆ we have 4 different estimates of pairs of possible solutions for ( Tˆ , R). In analogy to what was done in the previous paragraph, the choice of the appropriate pair is made through the 3D reconstruction starting from the projections to solve the ambiguity. In fact, for each 3D point, the third component is calculated in the reference system of the left camera considering the 4 possible pairs of solutions ( Tˆ , R). The relation that for a point P of the 3D space links the coordinates P L = (X L , Y L , Z L ) and P R = (X R , Y R , Z R ) among the reference systems of the stereo cameras is given by (7.93), that is, P R = R( P L − T ) to reference P with respect to the left camera, from which we can derive the third component Z R : (7.214) Z R = R3T ( P L − Tˆ )

7.5 Stereo Vision and Epipolar Geometry

663

and from the relation (6.208) which links the point P and its projection in the image on the right we have fR f R R( P L − Tˆ ) PR = pR = (7.215) ZR R T ( P L − Tˆ ) 3

from which we derive the first component of p R given by f R R1T ( P L − Tˆ ) R T ( P L − Tˆ )

xR =

(7.216)

3

In analogy to (7.215), we have the equation that links the coordinates of P in the left image plane: fL PL (7.217) pL = ZL Replacing (7.217) in (7.216) and resolving with respect to Z L , we get Z L = fL

( f R R1 − x R R3 )T Tˆ ( f R R 1 − x R R 3 )T p L

(7.218)

From (7.217), we get P L and considering (7.218) we finally get the 3D coordinates of P in the reference systems of the two cameras: PL =

( f R R1 − x R R3 )T Tˆ ( f R R 1 − x R R 3 )T

P R = R( P L − Tˆ )

(7.219)

Therefore, being able to calculate for each point to reconstruct the depth coordinates Z L and Z R for both cameras it is possible to choose the appropriate pair (R, Tˆ ), that is, the one for which the depths are both positive because the scene to be reconstructed is in front of the stereo system. Let’s summarize the essential steps of the algorithm: 1. Given the correspondences of homologous points estimate the essential matrix E. 2. Computes the normalized translation vector Tˆ with (7.211). 3. Computes the rotation matrix R with Eqs. (7.212) and (7.213). 4. Computes the depths Z L and Z R for each point P with Eqs. (7.217)–(7.219). 5. E xamine the sign of the depths Z L and Z R of the reconstruction point: a. If both are negative for some points, change the sign of Tˆ and go back to step 4. b. Otherwise, if one is negative and the other is positive for some point, change the sign of each element of the matrix Eˆ and go back to point 3. c. Otherwise, if both depths of the reconstruction points are positive, then it ends. Recall that the 3D points of the scene are reconstructed less than an incognito scale factor.

664

7 Camera Calibration and 3D Reconstruction

7.5.7.5 3D Reconstruction with Known only the Correspondences of Homologous Points In this case, we have only N ≥ 8 correspondences and a completely uncalibrated stereo system available without the knowledge of intrinsic and extrinsic parameters. In 1992 three groups of researchers [21–23] independently dealt with the problem of 3D reconstruction starting from uncalibrated cameras and all three researches were based on projective geometry. The proposed solutions reconstructed the scene not unambiguously but less than a projective transformation of the scene itself. The fundamental matrix F can be estimated from the N correspondences in the stereo system together with the location of the epipoles e L and e R . The matrix F does not depend on the choice of the 3D point reference system in the world, while it is known that this dependency exists for the projection matrices P L and P R of the stereo cameras. For example, if the world coordinates are rotated the camera projection matrix varies while the fundamental matrix remains unchanged. In particular, if H is a projective transformation matrix in 3D space, then the fundamental matrix associated with the pair of projection matrices of the stereo cameras (P L , P R ) and (P L H, P R H) are the same (we recall from Sect. 7.5.2.1 that the relation between fundamental and homography matrix is given by F = [e R ]× H). It follows, although a pair of projection matrices (P L , P R ) of the cameras univocally determines a fundamental matrix F, the vice versa does not occur. Therefore, the cameras’ matrices are defined less than a projective transformation with respect to the fundamental matrix. This ambiguity can be controlled by choosing appropriate matrices of projections consisting of the fundamental matrix F such that P L = [I| 0]

P R = [[e R ]× F| e R ]

and

as described in [10]. Then with these matrices triangulate the 3D points by retroprojection of the corresponding projections. In summary, it is shown that in the context of uncalibrated cameras, the ambiguity in the reconstruction is attributable only to an arbitrary projective transformation. In particular, given a set of correspondences for a stereo system, the fundamental matrix is uniquely determined, then the cameras’ matrix is estimated and then the scene can be reconstructed with only these correspondences. It should be noted, however, that any two reconstructions from these correspondences are equivalent from the projective point of view, that is, the reconstruction is not unique but less than a projective transformation (see Fig. 7.14). The ambiguity of 3D reconstruction from uncalibrated cameras is formalized by the following projective reconstruction theorem [10]: Theorem 7.3 Let p L i ↔ p Ri the correspondences of homologous points in the stereo images and F is the uniquely determined fundamental matrix that satisfies the relation pTRi F p L i = 0 ∀ i. (1)

(1)

(1)

(2)

(2)

(2)

Let (P L , P R , { P i }) and (P L , P R , { P i }) two possible reconstructions associated with the correspondences p L i ↔ p Ri .

7.5 Stereo Vision and Epipolar Geometry

665 Object reconstructed with non-calibrated stereovision

Original 3D object observed

Ambiguous Projective Reconstruction

Fig. 7.14 Ambiguous 3D reconstruction from a non-calibrated stereo system with only the projections of the homologous points known. The 3D reconstruction, although the structure of the scene emerges, takes place unless an unknown projective transformation

Then, there exists a nonsingular matrix H 4×4 such that (2)

(1)

P L = P L H −1

(2)

(1)

P R = P R H −1

and

(2)

Pi

(1)

= H Pi

for all i, except for those i such that F p L i = pTRi F = 0 (coincident with the epipoles related to stereo images). In essence, the 3D triangulation of points P i is reconstructed unless an unknown projective transformation H 4×4 . In fact, if the reconstructed 3D points are transformed with a projective matrix H, these become P i = H 4×4 P i

(7.220)

with the associated projection matrices of the stereo cameras: PL = P L H −1

PR = P R H −1

(7.221)

but the original projection points p L i ↔ p Ri are the same (together with the F), verifying as follows: p L i = P L P i = P L H −1 H P i = PL P i p Ri = P R P i = P R H −1 H P i = PR P i

(7.222)

assuming H an invertible matrix and the fundamental matrix uniquely determined. The ambiguity can be reduced if you have additional information on the 3D scene to reconstruct or using additional information on the stereo system. For example, having available 3D fiducial points (with at least 5 points) the ambiguity introduced by the projective can be eliminated by obtaining a reconstruction associated with a real metric. In [20,24], there are two 3D reconstruction approaches with uncalibrated cameras. Both approaches are based on the fact that the reconstruction is not unique so that a basic projective transformation can be chosen arbitrarily. This is defined by choosing

666

7 Camera Calibration and 3D Reconstruction

only 5 3D points (of which 4 must not be coplanar), of the N of the scene, used to define a basic projective transformation. The first approach [20], starting from the basic projective finds the projection matrices (known the epipoles) with algebraic methods while the second approach [24] uses a geometric method based on the epipolar geometry to select the reference points in the image plans.

References 1. S.J. Maybank, O.D. Faugeras, A theory of self-calibration of a moving camera. Int. J. Comput. Vis. 8(2), 123–151 (1992) 2. B. Caprile, V. Torre, Using vanishing points for camera calibration. Int. J. Comput. Vis. 4(2), 127–140 (1990) 3. R.Y. Tsai, A versatile camera calibration technique for 3d machine vision. IEEE J. Robot. Autom. 4, 323–344 (1987) 4. J. Heikkila, O. Silvén, A four-step camera calibration procedure with implicit image correction, in IEEE Proceedings of Computer Vision and Pattern Recognition (1997), pp 1106–1112 5. O.D. Faugeras, G. Toscani, Camera calibration for 3d computer vision, in International Workshop on Machine Vision and Machine Intelligence (1987), pp. 240–247 6. Z. Zhengyou, A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000) 7. R.K. Lenz, R.Y. Tsai, Techniques for calibration of the scale factor and image center for high accuracy 3-d machine vision metrology. IEEE Trans. Pattern Anal Mach Intell 10(5), 713–720 (1988) 8. G.H. Golub, C.F. Van Loan, Matrix Computations, 3rd edn. (Johns Hopkins, 1996). ISBN 978-0-8018-5414-9 9. Z. Zhang, A flexible new technique for camera calibration. Technical Report MSR- TR-98-71 (Microsoft Research, 1998) 10. R. Hartley, A. Zisserman, Multiple View Geometry in computer vision, 2nd. (Cambridge, 2003) 11. O. Faugeras, Three-Dimensional Computer Vision: A Geometric Approach (MIT Press, Cambridge, Massachusetts, 1996) 12. J. Vince, Matrix Transforms for Computer Games and Animation (Springer, 2012) 13. H.C. Longuet-Higgins, A computer algorithm for reconstructing a scene from two projections. Nature 293, 133–135 (1981) 14. I. Hartley Richard, In defense of the eight-point algorithm. IEEE Trans. Pattern Recogn. Mach. Intell. 19(6), 580–593 (1997) 15. Q.-T. Luong, O. Faugeras, The fundamental matrix: theory, algorithms, and stability analysis. Int. J. Comput. Vis. 1(17), 43–76 (1996) 16. Nistér David, An efficient solution to the five-point relative pose problem. IEEE Trans. Pattern Recogn. Mach. Intell. 26(6), 756–777 (2004) 17. O. Faugeras, S. Maybank, Motion from point matches : Multiplicity of solutions. Int. J. Comput. Vis. 4, 225–246 (1990) 18. T.S. Huang, O.D. Faugeras, Some properties of the e matrix in two-view motion estimation. IEEE Trans. Pattern Recogn. Mach. Intell. 11(12), 1310–1312 (1989) 19. C. Loop, Z. Zhang, Computing rectifying homographies for stereo vision, in IEEE Conference of Computer Vision and Pattern Recognition (1999), vol. 1, pp. 125–131 20. E. Trucco, A. Verri, Introductory Techniques for 3-D Computer Vision (Prentice Hall, 1998) 21. R. Mohr, L. Quan, F. Veillon, B. Boufama, Relative 3d reconstruction using multiples uncalibrated images. Technical Report RT 84-I-IMAG LIFIA 12, Lifia-Irimag (1992)

References

667

22. O.D. Faugeras, What can be seen in three dimensions from an uncalibrated stereo rig, in ECCV European Conference on Computer Vision (1992), pp. 563–578 23. R. Hartley, R. Gupta, T. Chang, Stereo from uncalibrated cameras, in IEEE CVPR Computer Vision and Pattern Recognition (1992), pp. 761–764 24. R. Mohr, L. Quan, F. Veillon, Relative 3d reconstruction using multiple uncalibrated images. Int. J. Robot. Res. 14(6), 619–632 (1995)

Index

Symbols 2.5D Sketch map, 342 3D representation object centered, 344 viewer centered, 344 3D stereo reconstruction by linear triangulation, 658 by triangulation, 656 knowing intrinsic parameters & Essential matrix, 661 knowing only correspondences of homologous points, 664 knowing only intrinsic parameters, 659 3D world coordinates, 605 A active cell, 324 Airy pattern, 466 albedo, 416 aliasing, 490, 491 alignment edge, 340 image, 533, 534, 646 pattern, 180 ambiguous 3D reconstruction, 665 angular disparity, 355 anti-aliasing, 490, 491 aperture problem, 483, 484, 498, 499, 514, 515 artificial vision, 316, 348, 393 aspect ratio, 599, 606, 608, 609 associative area, 369 associative memory, 229 autocorrelation function, 276, 282

B background modeling based on eigenspace, 564, 565 based on KDE, 563, 564 BS based on GMM, 561, 562 BS with mean/median background, 558, 559 BS with moving average background, 559, 560 BS with moving Gaussian average, 559, 560 BS-Background Subtraction, 557, 558 non-parametric, 566, 567 parametric, 565, 566 selective BS, 560, 561 backpropagation learning algorithm batch, 119 online, 118 stochastic, 118 Bayes, 30 classifier, 48 rules, 37, 39 theorem, 38 Bayesian learning, 56 bias, 51, 62, 91, 93 bilinear interpolation, 525, 526 binary coding, 455, 458, 462 binary image, 231, 496, 497 binocular fusion, 351 binocular fusion fixation point, 352 horopter, 353 Vieth-Muller ¨ circumference, 353 binocular vision angular disparity calculation, 389 computational model, 377

© Springer Nature Switzerland AG 2020 A. Distante and C. Distante, Handbook of Image Processing and Computer Vision, https://doi.org/10.1007/978-3-030-42378-0

669

670 depth calculation with parallel axes, 385 Marr-Poggio algorithm I, 378 Marr-Poggio algorithm II, 380 PMF algorithm, 406 triangulation equation, 387 binocular disparity, 358, 373, 374, 391 binomial distribution, 78, 142 bipartite graph, 539, 540 blurred image, 466, 472 blurring circle, 473 filter, 470 Boltzmann machine, 236 bounding box, 557, 558 BRDF-Bidirectional Reflectance Distribution Function, 415 Brewster stereoscope, 354 brightness continuity equation, see irradiance constancy constraint equation Brodatz’s texture mosaic, 294 bundle adjustment, 590, 591 C calibration, see camera calibration calibration matrix, 615, 630, 652 calibration sphere, 436, 437 CAM-Content-Addressable Memory, 225 camera coordinates, 614, 626 camera calibration accuracy, 625 algorithms, 603 equation, 589, 590 extrinsic parameters, 589, 590 intrinsic parameters, 588, 589 matrix, 588, 589 platform, 603, 616 radial distortions, 623 stereo vision, 625 tangential distortion, 601 Tsai method, 605 Zhang method, 616 camera projection matrix, 585, 586, see also camera calibration category level classification, 3 Bayesian discriminant, 57 deterministic, 17 FCM-Fuzzy C-Means , 35 Gaussian probability density, 58 interactive, 17 ISODATA, 34

Index Mixtures of Gaussian, 66 MultiLayer Perceptrons - MLP, 110 neural network, 87 nonmetric methods, 125 statistical, 37 Cauchy, 199 CBIR-Content-Based Image Retrieval, 313 center of mass, 59, 591, 592 central limit theorem, 50, 58 centroid, 24 child node, 133 Cholesky factorization, 594, 595, 621 Chow’s rule, 46 clustering, 2 agglomerative hierarchical, 149 divisive hierarchical, 152 Hierarchical, 148 K-means, 30 clustering methods, 4 CND image, 313 CNN-Convolutional Neural Network, 240 coherence measure, 306 collineation, 616 collision time estimation, 573, 574 knowing the FOE, 579, 580 color space, 16 complex motion estimation background, 556, 557 foreground, 557, 558 motion parameters calculation by OF, 580, 581 complex conjugate operator, 447 Computational complexity, 147 confusion circle, 466, 473 contrast, 274, 309, 394 convolution filter, 241, 292, 380 mask, 241, 243, 291, 381, 511, 512 theorem, 278 convolutional layer, 242 CoP-Center of Projection, 567, 568 correlation matrix, 15, 219 correspondence structure detection local POIs, 396 point-like elementary, 394 strategy, 393 correspondence problem, 374, 390, 483, 484, 536, 537, 539, 540, 636, 646 covariance matrix, 13, 50, 66, 545, 546, 564, 565 Cover’s theorem, 194

Index cross correlation function, 397 cross validation, 86, 103 CT-Census Transform, 403 D data compression, 16, 217 decision trees algorithm, 127 C4.5, 137 CART, 143 ID3, 129 deep learning, 238 CNN architectures, 256 dropout, 251 full connected layer, 246 pooling layer, 245 stochastic gradient descent, 249 defocusing, 468, 474 delta function, 84 depth calculation before collision, 576, 577 knowing the FOE, 579, 580 depth map, 342, 374, 375, 426, 453, 472 depth of field, 465, 468 DFS-Deterministic Finite State, 169 DFT-Discrete Fourier Transform, 445 diagonalization, 13 diffuse reflectance, see Lambertian model diffuse reflection, 416 digitization, 388, 599 Dirichlet tessellation, 30, see also Voronoi diagram Discrete Cosine Transform-DCT, 471 disparity map, 380, 403 dispersion function, 466 matrix, 23 measure, 23 parameter, 467 displacement vector, 582, 583 distortion function, 623 distortion measure, 31 divide and conquer, 128, 147 DLT-Direct Linear Transformation, 604 DoG-Difference of Gaussian, 323, 537, 538 E early vision, 342 eccentricity, 345 edge extraction algorithms, 304, 310, 322, 324 ego-motion, 580, 581 eigen decomposition, 60 eigenspace, 564, 565

671 eigenvalue, 13, 24, 612 eigenvector, 13, 514, 515, 564, 565, 606, 612 EKF-Extended Kalman Filter, 556, 557 electric-chemical signal, 91 electrical signal, 89 EM-Expectation–Maximization, 32, 67, 69 epipolar constraint, 632 epipolar geometry, 394, 408, 627 epipolar line, 389, 628 epipolar plane, 389, 628 epipole, 390, 628 Essential matrix, 629 5-point algorithm, 641 7-point algorithm, 641 8-point algorithm, 638 8-point normalization, 641 decomposition, 642 Euclidean distance, 28, 59, 395, 408, 560, 561 Euler-Lagrange equation, 441 extrinsic parameter estimation from perspective projection matrix P, 614 F f# number, 473 factorization methods, 623, 648 false negative, 381 false positive, 381, 563, 564 feature extraction, 194 homologous, 497, 498 selection, 4, 218 significant, 7, 10, 241 space, 8 vector, 4, 282 filter bandpass, 296 binomial, 465 Gabor, 295, 365 Gaussian, 304 Gaussian bandpass, 297 high-pass, 469 Laplacian of Gaussian, 322 low-pass, 468 median, 400 smoothing, 322 Fisher’s linear discriminant function, 21 FOC-Focus Of Contraction, 571–574 focal length, 387, 414, 452, 587, 588, 603, 610 FOE-Focus Of Expansion, 571, 572 calculation, 577, 578 Fourier descriptors, 8

672 Fourier transform, 278 power spectrum, 278, 279 spectral domain, 279, 475 Freeman’s chain code, 155 Frobenius norm, 639 Fundamental and Homography matrix: relationship, 636 Fundamental matrix, 634 fundamental radiometry equation, 417 G Gabor filter bank, 300 Gaussian noise, 287, 554, 555 Gaussian probability density, 559, 560 Gaussian pyramid, 537, 538 GBR-Generalized Bas-Relief transform, 435 generalized cones, 345 geometric collineation, see homography transformation geometric distortion, 388, 448, 599 geometric transformation, 296, 388, 581, 582, 588, 589, 599, 645 geometric transformation in image formation, 602 Gestalt theory, 326 Gini index, 143 GLCM-Gray-Level Co-occurrence Matrix, 270, 310 gradient space, 417, 419 gradient vector, 97 graph isomorphism, 409 Green function, 203 H Hamming distance, 230 harmonic function, 296 Harris corner detector, 532, 533, 616 Helmholtz associationism, 326 Hessian matrix, 530, 531 hierarchical clustering algorithms agglomerative, 149 divisive, 151 high-speed object tracking by KF, 544, 545 histogram, 77, 267 homogeneous coordinates, 528, 529 homography matrix, 463, 617, 637 calculation by SVD decomposition, 617 homography transformation, 616 homologous structures calculation census transform, 403 correlation measures, 398

Index dissimilarity measures SSD & SAD, 399 gradient-based matching, 404 non metric RD, 401 similarity measures, 394 Hopfield network, 225 Hough transform, 578, 579 human binocular vision, 350, 351 human brain, 87, 211 human visual system, 265, 315, 348, 350, 374, 480 hyperplane equation, 62 I ill conditioned, see ill-posed problems ill-posed problems, 201 illumination incoherent, 466 Lambertian, 425 illusion, 356, 485, 486 image compression, 16 image filtering, 291 image gradient, 304, 503, 504 image irradiance Lambertian, 416 image irradiance fundamental equation, 414 image resampling, 649 impulse response, 290 incident irradiance, 416 Information gain, 131 infrared-sensitive camera, 451 inner product, 94 interpolation matrix, 198 interpolation process, 396, 426 intrinsic parameter estimation from homography matrix H, 619 from perspective projection matrix P, 612 intrinsic image, 307 inverse geometric transformation, 655 inverse geometry, 464 inverse problem, 315, 349, 413, 590, 591 irradiance constancy constraint equation, 501, 502 iso-brightness curve, 421 isomorphism, 171, 539, 540 isotropic fractals, 288 Gaussian function, 208 Laplace operator, 469 iterative numerical methods - sparse matrix, 521, 522

Index J Jacobian function, 529, 530 matrix, 531, 532 K KDE-Kernel Density Estimation, 563, 564 kernel function, 81 KF-Kalman filter, 544, 545 ball tracking example, 546, 547, 553, 554 gain, 545, 546 object tracking, 543, 544 state correction, 549, 550 state prediction, 545, 546 KLT algorithm, 536, 537 kurtosis, 269 L Lambertian model, 321, 416, 430 Laplace operator, 281, see also LOG-Laplacian of Gaussian LDA-Linear Discriminant Analysis, 21 least squares approach, 514, 515, 618 lens aperture, 466 crystalline, 387 Gaussian law, 465 line fitting, 513, 514 local operator, 308 LOG-Laplacian of Gaussian, 290, 380 LSE-Least Square Error, 509, 510 LUT-Look-Up-Table, 436 M Mahalonobis distance, 59 MAP-Maximum A Posterior, 39, 67 mapping function, 200 Marr’s paradigm algorithms and data structures level, 318 computational level, 318 implementation level, 318 Maximum Likelihood Estimation, 49 for Gaussian distribution & known mean, 50 for Gaussian with unknown µ and , 50 mean-shift, 566, 567 Micchelli’s theorem, 199 minimum risk theory, 43 MLE estimator distortion, 51 MND-Multivariate Normal Distribution, 58 MoG-Mixtures of Gaussian, 66, see also EM-Expectation–Maximization

673 moment central, 268, 309 inertia, 274 normalized spatial, 8 momentum, 121, 273 motion discretization aperture problem, 498, 499 frame rate, 487, 488 motion field, 494, 495 optical flow, 494, 495 space–time resolution, 492, 493 space-time frequency, 493, 494 time-space domain, 492, 493 visibility area, 492, 493 motion estimation by compositional alignment, 532, 533 by inverse compositional alignment, 533, 534 by Lucas–Kanade alignment, 526, 527 by OF pure rotation, 572, 573 by OF pure translation, 571, 572 by OF-Optical Flow, 570, 571 cumulative images difference, 496, 497 image difference, 496, 497 using sparse POIs, 535, 536 motion field, 485, 486, 494, 495 MRF-Markov Random Field, 286, 476 MSE-Mean Square Error, 124, 544, 545 multispectral image, 4, 13, 16, 17 N NCC-Normalized Cross-Correlation, 527, 528 needle map, 426, see also orientation map neurocomputing biological motivation mathematical model, 90 neurons structure, 88 synaptic plasticity, 89 neuron activation function, 90 ELU-Exponential Linear Units, 245 for traditional neural network, 91 Leaky ReLU, 245 Parametric ReLU, 245 properties of, 122 ReLU-Rectified Linear Units, 244 Neyman–Pearson criterion, 48 nodal point, 352 normal map, 441 normal vector, 417, 419, 430 normalized coordinate, 632 NP-complete problem, 147

674 O optical center, 385 optical flow estimation affine motion, 521, 522 BBPW method, 517, 518 brightness preservation, 502, 503 discrete least-squares, 509, 510 homogeneous motion, 522, 523 Horn-Schunck, 504, 505 Horn-Schunck algorithm, 512, 513 iterative refinement, 523, 524 large displacements, 523, 524 Lucas-Kanade, 513, 514 Lucas-Kanade variant, 516, 517 multi-resolution approach, 525, 526 rigid body, 566, 567 spatial gradient vector, 502, 503 time gradient, 502, 503 orientation map, 342, 348, 419, 426, 440 Oriented Texture Field, 303 orthocenter theorem, 610 orthogonal matrix properties, 607 orthographic projection, 418, 591, 592 orthonormal bases, 607, 608 OTF-Optical Transfer Function, 468 outer product, 608, 632 P Parseval’s theorem, 446 Parzen window, 81 PCA-Principal Component Analysis, 11, 13, 27, 301, 564, 565 principal plane, 14 Pearson’s correlation coefficient, 15 perspective projection, 449 canonical matrix of, 587, 588, 603 center, 587, 588 equations, 567, 568 matrix, 568, 569, 589, 590, 600 matrix estimation, 610 non-linear matrix estimation, 615 pinhole model, 585, 586 physical coherence, 377 pinhole model, 452, 484, 485, 653 pixel coordinates image, 588, 589 POD-proper orthogonal decomposition, see PCA POI-Point of interest, 393, 454, 495, 496, 514, 515, 536–541 POIs tracking using graphs similarity, 539, 540

Index using Kalman filter, 542, 543 using POIs probabilistic correspondence, 540, 541 polar coordinate, 280, 452 preface, vii Prewitt kernel, 310 primary visual cortex, 360, 366 color perception area, 372 columnar organization, 366 complex cells, 364 depth perception area, 373 form perception area, 373 Hubel and Wiesel, 362 hypercomplex cells, 365 interaction between cortex areas, 369 movement perception area, 373 neural pathway, 368 receptive fields, 362 simple cells, 362 visual pathway, 365 projective reconstruction theorem, 664 projective transformation, see homography transformation pseudo-inverse matrix, 209 PSF-Point Spread Function, 466 pyramidal cells, 360 pyramidal neurons, 360 Q QR decomposition, 615 quadratic form, 59, 61 quantization error, 218 R radial optical distortions barrel effect, 601 estimation, 622 pincushion effect, 601 radiance emitted, 416 radiant energy, 416 radiant flux, 416 random-dot stereograms, 356 rank distance, 402 rank transform, 401 Rayleigh theorem, 446 RBC-Reflected Binary Code, 456 RBF-Radial Basis Function, 198 RD-Rank Distance, see rank transform rectification of stereo images calibrated, 652 non calibrated, 646

Index recurrent neural architecture, 223, see also Hopfield network reflectance coefficient, 416 reflectance map, 414 regularization theory, 201 Green function, 203 parameter, 202 Tikhonov functional, 202 retinal disparity, 351 crossed, 353 uncrossed, 353, 356 rigid body transformation, 589, 590 S SAD-Sum of Absolute Difference, 400, 528, 529 SfM-Structure from Motion, 585, 586 3D reconstruction, 590, 591 SVD decomposition methods, 590, 591 Shannon-Nyquist sampling theorem, 493, 494 Shape from Shading equation, 420 Shape from X, 348 contour, 423 defocus, 465, 472 focus, 465, 468 pattern with phase modulation, 459 shading, 413, 420, 424 stereo, 348 stereo photometry, 426 structured colored patterns, 463 structured light, 451 structured light binary coding, 454 structured light gray code, 456 structured light gray level, 458 texture, 448 SIFT tracking, 539, 540 SIFT-Scale-Invariant Feature Transform, 537, 538 descriptor, 537, 538 detector, 537, 538 SIFT-Scale-Invariant Feature Transform, 539, 540 similarity function, see correlation function simulated annealing, 237 skew parameter, 588, 589, 606 skewness, 269 SML-Sum Modified Laplacian, 471 smoothness constraint, 509, 510 smoothness error, 504, 505 SOM-Self-Organizing Map competitive learning, 211

675 feature selection, 219 Hebbian learning, 213 topological ordering, 218 space-time coherence, 525, 526 space-time gradient, 517, 518 sparse elementary structures, 406 sparse map, 495, 496 spatial coherence, 501, 502 spectral band, 4, 40 specular reflection, 416 SPI-Significant Points of Interest, 495, 496 SSD-Sum of Squared Difference, 400, 528, 529 standardization procedure, 14 stereo photometry calibrated, 436 diffuse light equation, 430 uncalibrated, 433 stereopsis, 356 neurophysiological evidence, 358 string recognition Boyer–Moore algorithm, 172 edit distance, 183 supervised learning artificial neuron, 90 backpropagation, 113 Bayesian, 52, 53 dynamic, 121 gradient-descent methods, 100 Ho-Kashyap, 104 perceptron, 95 Widrow–Hoff, 103 surface reconstruction by global integration, 443 by local integration, 442 from local gradient, 441 from orientation map, 440 SVD decomposition, 622, 639 SVD-Singular Value Decomposition, 209, 433 syntactic recognition, 154 ascending analysis, 166 descending analysis, 165 formal grammar, 156 grammars types, 161 language generation, 158 T tangential optical distortions, 601 template image tracking, 528, 529 temporal derivatives, 511, 512 tensor

676

Index

matrix, 515, 516 structure, 514, 515 texture, 261 based on autocorrelation, 276 based on co-occurrence matrix, 272 based on edge metric, 281 based on fractals models, 286 based on Gabor filters, 295 based on Run Length primitives, 283 based on spatial filtering, 290 coherence, 306 Julesz’s conjecture of, 264 oriented field of, 303 perceptive features of, 308 spectral method of, 278 statistical methods of, 267 syntactic methods for, 302 texture visual perception, 261 thin lens, 465 thin lens formula, 472 tracking of POIs using SIFT POIs, 537, 538 trichromatic theory, 359 TTC-Time To Collision, see collision time estimation

W warped image, 528, 529 warping transformation, see geometric transformation wavelet transform, 266, 295, 471 white noise, 546, 547 whitening transform, 60 wrapped phase, 460

U unsupervised learning brain, 89 Hebbian, 213

Z zero crossing, 322, 324, 380, 495, 496 ZNCC-Zero Mean Normalized CrossCorrelation, 398

hierarchical clustering, 148 K-means, 30 Kohonen map, 210 unwrapped phase, 459 V vanishing point, 604, 610 Vector Quantization theory, 220 vision strategy bottom-up hierarchical control, 316 hybrid control, 317 nonhierarchical control, 317 top-down hierarchical control, 317 visual tracking, 479 Voronoi diagram, 30