243 22 11MB
English Pages 454 [448] Year 2020
Arcangelo Distante Cosimo Distante
Handbook of Image Processing and Computer Vision Volume 2: From Image to Pattern
Handbook of Image Processing and Computer Vision
Arcangelo Distante • Cosimo Distante
Handbook of Image Processing and Computer Vision Volume 2: From Image to Pattern
123
Arcangelo Distante Institute of Applied Sciences and Intelligent Systems Consiglio Nazionale delle Ricerche Lecce, Italy
Cosimo Distante Institute of Applied Sciences and Intelligent Systems Consiglio Nazionale delle Ricerche Lecce, Italy
ISBN 978-3-030-42373-5 ISBN 978-3-030-42374-2 https://doi.org/10.1007/978-3-030-42374-2
(eBook)
© Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my parents and my family, Maria and Maria Grazia—Arcangelo Distante To my parents, to my wife Giovanna, and to my children Francesca and Davide—Cosimo Distante
Preface
In the last 20 years, several interdisciplinary researches in the fields of physics, information technology and cybernetics, the numerical processing of Signals and Images, electrical and electronic technologies have led to the development of Intelligent Systems. The so-called Intelligent Systems (or Intelligent Agents) represent the still more advanced and innovative frontier of research in the electronic and computer field, able to directly influence the quality of life, competitiveness, and production methods of companies; to monitor and evaluate the environmental impact; to make public service and management activities more efficient; and to protect people’s safety. The study of an intelligent system, regardless of the area of use, can be simplified into three essential components: 1. the first interacts with the environment for the acquisition of data of the domain of interest, using appropriate sensors (for the acquisition of Signals and Images); 2. the second analyzes and interprets the data collected by the first component, also using learning techniques to build/update adequate representations of the complex reality in which the system operates (Computational Vision); 3. the third chooses the most appropriate actions to achieve the objectives assigned to the intelligent system (choice of Optimal Decision Models) interacting with the first two components, and with human operators, in case of application solutions based on man–machine cooperative paradigms (the current evolution of automation including industrial one). In this scenario of knowledge advancement for the development of Intelligent Systems, the information content of this manuscript is framed in which are reported the experiences of multi-year research and teaching of the authors and of the scientific insights existing in the literature. In particular, the manuscript divided into three parts (volumes), deals with aspects of the sensory subsystem in order to perceive the environment in which an intelligent system is immersed and able to act autonomously. The first volume describes the set of fundamental processes of artificial vision that lead to the formation of the digital image from energy. The phenomena of light propagation (Chapters 1 and 2), the theory of color perception (Chapter 3), the
vii
viii
Preface
impact of the optical system (Chapter 4), the aspects of transduction from luminous energy are analyzed (the optical flow) with electrical signal (of the photoreceptors), and aspects of electrical signal transduction (with continuous values) in discrete values (pixels), i.e., the conversion of the signal from analog to digital (Chapter 5). These first 5 chapters summarize the process of acquisition of the 3D scene, in symbolic form, represented numerically by the pixels of the digital image (2D projection of the 3D scene). Chapter 6 describes the geometric, topological, quality, and perceptual information of the digital image. The metrics are defined, the aggregation and correlation modalities between pixels, useful for defining symbolic structures of the scene of higher level with respect to the pixel. The organization of the data for the different processing levels is described in Chapter 7, while in Chapter 8 the representation and description of the homogeneous structures of the scene is shown. With Chapter 9 starts the description of the image processing algorithms, for the improvement of the visual qualities of the image, based on point, local, and global operators. Algorithms operating in the spatial domain and in the frequency domain are shown, highlighting with examples the significant differences between the various algorithms also from the point of view of the computational load. The second volume begins with the chapter describing the boundary extraction algorithms based on local operators in the spatial domain and on filtering techniques in the frequency domain. In Chap. 2 are presented the fundamental linear transformations that have immediate application in the field of image processing, in particular, to extract the essential characteristics contained in the images. These characteristics, which effectively summarize the global informational character of the image, are then used for the other image processing processes: classification, compression, description, etc. Linear transforms are also used, as global operators, to improve the visual qualities of the image (enhancement), to attenuate noise (restoration), or to reduce the dimensionality of the data (data reduction). In Chap. 3, the geometric transformations of the images are described, necessary in different applications of the artificial vision, both to correct any geometric distortions introduced during the acquisition (for example, images acquired while the objects or the sensors are moving, as in the case of satellite and/or aerial acquisitions), or to introduce desired visual geometric effects. In both cases, the geometrical operator must be able to reproduce as accurately as possible the image with the same initial information content through the image resampling process. In Chap. 4 Reconstruction of the degraded image (image restoration), a set of techniques are described that perform quantitative corrections on the image to compensate for the degradations introduced during the acquisition and transmission process. These degradations are represented by the fog or blurring effect caused by the optical system and by the motion of the object or the observer, by the noise caused by the opto-electronic system and by the nonlinear response of the sensors, by random noise due to atmospheric turbulence or, more generally, from the process of digitization and transmission. While the enhancement techniques tend to reduce the degradations present in the image in qualitative terms, improving their
Preface
ix
visual quality even when there is no knowledge of the degradation model, the restoration techniques are used instead to eliminate or quantitatively attenuate the degradations present in the image, starting also from the hypothesis of knowledge of degradation models. Chapter 5, Image Segmentation describes different segmentation algorithms, which is the process of dividing the image into homogeneous regions, where all the pixels that correspond to an object in the scene are grouped together. The grouping of pixels in regions is based on a homogeneity criterion that distinguishes them from one another. Segmentation algorithms based on criteria of similarity of pixel attributes (color, texture, etc.) or based on geometric criteria of spatial proximity of pixels (Euclidean distance, etc.) are reported. These criteria are not always valid, and in different applications, it is necessary to integrate other information in relation to the a priori knowledge of the application context (application domain). In this last case, the grouping of the pixels is based on comparing the hypothesized regions with the a priori modeled regions. Chapter 6 Detectors and descriptors of points of interest describes the most used algorithms to automatically detect significant structures (known as points of interest, corners, features) present in the image corresponding to stable physical parts of the scene. The ability of such algorithms is to detect and identify physical parts of the same scene in a repeatable way, even when the images are acquired under conditions of lighting variability and change of the observation point with possible change of the scale factor. The third volume describes the artificial vision algorithms that detect objects in the scene, attempt their identification, 3D reconstruction, their arrangement and location with respect to the observer, and their eventual movement. Chapter 1 Object recognition describes the fundamental algorithms of artificial vision to automatically recognize the objects of the scene, essential characteristics of all systems of vision of living organisms. While a human observer also recognizes complex objects, apparently in an easy and timely manner, for a vision machine, the recognition process is difficult, requires considerable calculation time, and the results are not always optimal. Fundamental to the process of object recognition, become the algorithms for selecting and extracting features. In various applications, it is possible to have an a priori knowledge of all the objects to be classified because we know the sample patterns (meaningful features) from which we can extract useful information for the decision to associate (decision making) each individual of the population to a certain class. These sample patterns (training set) are used by the recognition system to learn significant information about the objects population (extraction of statistical parameters, relevant characteristics, etc.). The recognition process compares the features of the unknown objects to the model pattern features, in order to uniquely identify their class of membership. Over the years, there have been various disciplinary sectors (machine learning, image analysis, object recognition, information research, bioinformatics, biomedicine, intelligent data analysis, data mining, …) and the application sectors (robotics, remote sensing, artificial vision, …) for which different researchers have proposed different methods of recognition and developed different algorithms based on
x
Preface
different classification models. Although the proposed algorithms have a unique purpose, they differ in the property attributed to the classes of objects (the clusters) and the model with which these classes are defined (connectivity, statistical distribution, density, …). The diversity of disciplines, especially between automatic data extraction (data mining) and machine learning (machine learning), has led to subtle differences, especially in the use of results and in terminology, sometimes contradictory, perhaps caused by the different objectives. For example, in data mining, the dominant interest is the automatic extraction of groupings, in the automatic classification is fundamental the discriminating power of the classes of belonging of the patterns. The topics of this chapter overlap between aspects related to machine learning and those of recognition based on statistical methods. For simplicity, the algorithms described are broken down according to the methods of classifying objects in supervised methods (based on deterministic, statistical, neural, and nonmetric models such as syntactic models and decision trees) and non-supervised methods, i.e., methods that do not use any prior knowledge to extract the classes to which the patterns belong. In Chapter 2 RBF, SOM, Hopfield and deep neural networks, four different types of neural networks are described: Radial Basis Functions—RBF, Self-Organizing Maps—SOM, the Hopfield, and the deep neural networks. RBF uses a different approach in the design of a neural network based on the hidden layer (unique in the network) composed of neurons in which radial-based functions are defined, hence the name of Radial Basis Functions, and which performs a nonlinear transformation of the input data supplied to the network. These neurons are the basis for input data (vectors). The reason why a nonlinear transformation is used in the hidden layer, followed by a linear one in the output one, allows a pattern classification problem to operate in a much larger space (in nonlinear transformation from the input in the hidden one) and is more likely to be linearly separable than a small-sized space. From this observation derives the reason why the hidden layer is generally larger than the input one (i.e., the number of hidden neurons is greater than the cardinality of the input signal). The SOM network, on the other hand, has an unsupervised learning model and has the originality of autonomously grouping input data on the basis of their similarity without evaluating the convergence error with external information on the data. Useful when there is no exact knowledge on the data to classify them. It is inspired by the topology of the brain cortex model considering the connectivity of the neurons and in particular, the behavior of an activated neuron and the influence with neighboring neurons that reinforce the connections compared to those further away that are becoming weaker. With the Hopfield network, the learning model is supervised and with the ability to store information and retrieve it through even partial content of the original information. It presents its originality based on physical foundations that has revitalized the entire field of neural networks. The network is associated with an energy function to be minimized during its evolution with a succession of states, until reaching a final state corresponding to the minimum of the energy function. This feature allows it to be used to solve and set up an optimization problem in
Preface
xi
terms of the objective function to be associated with an energy function. The chapter concludes with the description of the convolutional neural networks (CNN), by now the most widespread since 2012, based on the deep learning architecture (deep learning). In Chapter 3 Texture Analysis, the algorithms that characterize the texture present in the images are shown. Texture is an important component for the recognition of objects. In the field of image processing has been consolidated with the term texture, any geometric and repetitive arrangement of the levels of gray (or color) of an image. In this context, texture becomes an additional strategic component to solve the problem of object recognition, the segmentation of images and the problems of synthesis. Some of the algorithms described are based on the mechanisms of human visual perception of texture. They are useful for the development of systems for the automatic analysis of the information content of an image obtaining a partitioning of the image in regions with different textures. In Chapter 4 3D Vision Paradigms are reported the algorithms that analyze 2D images to reconstruct a scene typically of 3D objects. A 3D vision system that has the fundamental problem typical of inverse problems, i.e., from single 2D images, which are only a two-dimensional projection of the 3D world (partial acquisition), must be able to reconstruct the 3D structure of the observed scene and eventually define a relationship between the objects. 3D reconstruction takes place starting from 2D images that contain only partial information of the 3D world (loss of information from the projection 3D!2D) and possibly using the geometric and radiometric calibration parameters of the acquisition system. The mechanisms of human vision are illustrated, based also on the a priori prediction and knowledge of the world. In the field of artificial vision, the current trend is to develop 3D systems oriented to specific domains but with characteristics that go in the direction of imitating certain functions of the human visual system. 3D reconstruction methods are described that use multiple cameras observing the scene from multiple points of view, or sequences of time-varying images acquired from a single camera. Theories of vision are described, from the Gestalt laws to the paradigm of Marr’s vision and the computational models of stereovision. In Chapter 5 Shape from Shading—(SfS) are reported the algorithms to reconstruct the shape of the visible 3D surface using only the brightness variation information (shading, that is, the level variations of gray or colored) present in the image. The inverse problem of reconstructing the shape of the surface visible from the changes in brightness in the image is known as the Shape from Shading problem. The reconstruction of the visible surface should not be strictly understood as a 3D reconstruction of the surface. In fact, from a single point of the observation of the scene, a monocular vision system cannot estimate a distance measure between observer and visible object, so with the SfS algorithms, there is a nonmetric but qualitative reconstruction of the 3D surface. It is described by the theory of the SfS based on the knowledge of the light source (direction and distribution), the model of reflectance of the scene, the observation point, and the geometry of the visible surface, which together contribute to the image formation process. The relationships between the light intensity values of the image and the geometry of the
xii
Preface
visible surface are derived (in terms of the orientation of the surface, point by point) under some lighting conditions and the reflectance model. Other 3D surface reconstruction algorithms based on the Shape from xxx paradigm are also described, where xxx can be texture, structured light projected onto the surface to be reconstructed, or 2D images of the focused or defocused surface. In Chapter 6 Motion Analysis, the algorithms of perception of the dynamics of the scene are reported, analogous to what happens, in the vision systems of different living beings. With motion analysis algorithms, it is possible to derive the 3D motion, almost in real time, from the analysis of sequences of time-varying 2D images. Paradigms on movement analysis have shown that the perception of movement derives from the information of the objects evaluating the presence of occlusions, texture, contours, etc. The algorithms for the perception of the movement occurring in the physical reality and not the apparent movement are described. Different methods of movement analysis are analyzed from those with limited computational load such as those based on time-variant image difference to the more complex ones based on optical flow considering application contexts with different levels of motion entities and scene-environment with different complexities. In the context of rigid bodies, from the motion analysis, derived from a sequence of time-variant images, are described the algorithms that, in addition to the movement (translation and rotation), estimate the reconstruction of the 3D structure of the scene and the distance of this structure by the observer. Useful information are obtained in the case of mobile observer (robot or vehicle) to estimate the collision time. In fact, the methods for solving the problem of 3D reconstruction of the scene are acquired by acquiring a sequence of images with a single camera whose intrinsic parameters remain constant even if not known (camera not calibrated) together with the non-knowledge of motion. The proposed methods are part of the problem of solving an inverse problem. Algorithms are described to reconstruct the 3D structure of the scene (and the motion), i.e., to calculate the coordinates of 3D points of the scene whose 2D projection is known in each image of the time-variant sequence. Finally, in Chapter 7 Camera Calibration and 3D Reconstruction, the algorithms for calibrating the image acquisition system (normally a single camera and stereovision) are fundamental for detecting metric information (detecting an object’s size or determining accurate measurements of object–observer distance) of the scene from the image. The various camera calibration methods are described that determine the relative intrinsic parameters (focal length, horizontal and vertical dimension of the single photoreceptor of the sensor, or the aspect ratio, the size of the matrix of the sensor, the coefficients of the radial distortion model, the coordinates of the main point or the optical center) and the extrinsic parameters that define the geometric transformation to pass from the reference system of the world to that of camera. The epipolar geometry introduced in Chapter 5 is described in this chapter to solve the problem of correspondence of homologous points in a stereo vision system with the two cameras calibrated and not. With the epipolar geometry is simplified the search for the homologous points between the stereo
Preface
xiii
images introducing the Essential matrix and the Fundamental matrix. The algorithms for estimating these matrices are also described, known a priori the corresponding points of a calibration platform. With epipolar geometry, the problem of searching for homologous points is reduced to mapping a point of an image on the corresponding epipolar line in the other image. It is possible to simplify the problem of correspondence through a one-dimensional point-to-point search between the stereo images. This is accomplished with the image alignment procedure, known as stereo image rectification. The different algorithms have been described some based on the constraints of the epipolar geometry (non-calibrated cameras where the fundamental matrix includes the intrinsic parameters) and on the knowledge or not of the intrinsic and extrinsic parameters of calibrated cameras. Chapter 7 ends with the section of the 3D reconstruction of the scene in relation to the knowledge available to the stereo acquisition system. The triangulation procedures for the 3D reconstruction of the geometry of the scene without ambiguity are described, given the 2D projections of the homologous points of the stereo images, known the calibration parameters of the stereo system. If only the intrinsic parameters are known, the 3D geometry of the scene is reconstructed by estimating the extrinsic parameters of the system at less than a non-determinable scale factor. If the calibration parameters of the stereo system are not available but only the correspondences between the stereo images are known, the structure of the scene is recovered through an unknown homography transformation. Francavilla Fontana, Italy January 2020
Arcangelo Distante Cosimo Distante
Acknowledgements
We thank all the fellow researchers of the Department of Physics of Bari, of the Institute of Intelligent Systems for Automation of the CNR (National Research Council) of Bari, and of the Institute of Applied Sciences and Intelligent Systems “Eduardo Caianiello” of the Unit of Lecce, for having indicated errors and parts to be reviewed. We mention them in chronological order: Grazia Cicirelli, Marco Leo, Giorgio Maggi, Rosalia Maglietta, Annalisa Milella, Pierluigi Mazzeo, Paolo Spagnolo, Ettore Stella, and Nicola Veneziani. A thank you is addressed to Arturo Argentieri for the support on the graphic aspects of the figures and the cover. Finally, special thanks are given to Maria Grazia Distante who helped us realize the electronic composition of the volumes by verifying the accuracy of the text and the formulas.
xv
Contents
1 Local 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19
1.20 1.21
Operations: Edging . . . . . . . . . . . . . . . . . . . . . . . Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . Gradient Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . Approximation of the Gradient Filter . . . . . . . . . . . Roberts Operator . . . . . . . . . . . . . . . . . . . . . . . . . Gradient Image Thresholding . . . . . . . . . . . . . . . . Sobel Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . Prewitt Operator . . . . . . . . . . . . . . . . . . . . . . . . . . Frei & Chen Operator . . . . . . . . . . . . . . . . . . . . . . Comparison of LED Operators . . . . . . . . . . . . . . . Directional Gradient Operator . . . . . . . . . . . . . . . . Gaussian Derivative Operator (DroG) . . . . . . . . . . Laplacian Operator . . . . . . . . . . . . . . . . . . . . . . . . Laplacian of Gaussian (LoG) . . . . . . . . . . . . . . . . . Difference of Gaussians (DoG) . . . . . . . . . . . . . . . Second-Directional Derivative Operator . . . . . . . . . Canny Edge Operator . . . . . . . . . . . . . . . . . . . . . . 1.16.1 Canny Algorithm . . . . . . . . . . . . . . . . . . . Point Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . Line Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . High-Pass Filtering . . . . . . . . . . . . . . . . . . . . . . . . 1.19.1 Ideal High-Pass Filter (IHPF) . . . . . . . . . . 1.19.2 Butterworth High-Pass Filter (BHPF) . . . . 1.19.3 Gaussian High-Pass Filter (GHPF) . . . . . . Ideal Band-Stop Filter (IBSF) . . . . . . . . . . . . . . . . 1.20.1 Butterworth and Gaussian Band-Stop Filter Band-Pass Filter (BPF) . . . . . . . . . . . . . . . . . . . . . 1.21.1 Ideal Band-Pass Filter (IBPF) . . . . . . . . . . 1.21.2 Butterworth and Gaussian Band-Pass Filter 1.21.3 Difference of Gaussian Band-Pass Filter . . 1.21.4 Laplacian Filter in the Frequency Domain .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 4 5 8 12 13 14 16 16 16 22 24 26 30 37 38 39 44 47 47 48 49 50 52 52 54 54 55 55 56 57
xvii
xviii
Contents
1.22 Sharpening Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.22.1 Sharpening Linear Filters . . . . . . . . . . . . . . . . 1.22.2 Unsharp Masking . . . . . . . . . . . . . . . . . . . . . . 1.22.3 Sharpening High-Boost Filter . . . . . . . . . . . . . 1.22.4 Sharpening Filtering in the Frequency Domain 1.22.5 Homomorphic Filter . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
58 58 60 62 64 64 67
2 Fundamental Linear Transforms . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 One-Dimensional Discrete Linear Transformation . . . . . . . . . . 2.2.1 Unitary Transforms . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Orthogonal Transforms . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Orthonormal Transforms . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Example of One-Dimensional Unitary Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Two-Dimensional Discrete Linear Transformation . . . . . . . . . 2.3.1 Example of a Two-Dimensional Unitary Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Observations on Unitary Transformations . . . . . . . . . . . . . . . . 2.4.1 Properties of Unitary Transformations . . . . . . . . . . . . 2.5 Sinusoidal Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Discrete Cosine Transform (DCT) . . . . . . . . . . . . . . . . . . . . . 2.7 Discrete Sine Transform (DST) . . . . . . . . . . . . . . . . . . . . . . . 2.8 Discrete Hartley Transform (DHT) . . . . . . . . . . . . . . . . . . . . 2.9 Transform with Rectangular Functions . . . . . . . . . . . . . . . . . . 2.9.1 Discrete Transform of Hadamard—DHaT . . . . . . . . . 2.9.2 Discrete Transform of Walsh (DWHT) . . . . . . . . . . . 2.9.3 Slant Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.4 Haar Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10 Transform Based on the Eigenvectors and Eigenvalues . . . . . . 2.10.1 Principal Component Analysis (PCA) . . . . . . . . . . . . 2.10.2 PCA/KLT for Data Compression . . . . . . . . . . . . . . . . 2.10.3 Computation of the Principal Axes of a TwoDimensional Object . . . . . . . . . . . . . . . . . . . . . . . . . 2.10.4 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . 2.10.5 Calculation of Significant Components in Multispectral Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10.6 Eigenface—Face Recognition . . . . . . . . . . . . . . . . . . 2.11 Singular Value Decomposition SVD Transform . . . . . . . . . . . 2.12 Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.12.1 Continuous Wavelet Transforms—CWT . . . . . . . . . . 2.12.2 Continuous Wavelet Transform—2D CWT . . . . . . . .
. . . . . .
. . . . . .
69 69 70 70 70 71
.. ..
72 73
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . . . . . . . . . .
. 74 . 74 . 76 . 77 . 78 . 82 . 82 . 84 . 84 . 89 . 91 . 93 . 95 . 98 . 103
. . 109 . . 109 . . . . . .
. . . . . .
111 114 120 123 125 127
Contents
2.12.3 Wavelet Transform as Band-pass Filtering . . . . . 2.12.4 Discrete Wavelet Transform—DWT . . . . . . . . . 2.12.5 Fast Wavelet Transform—FWT . . . . . . . . . . . . 2.12.6 Discrete Wavelet Transform 2D—DWT2 . . . . . 2.12.7 Biorthogonal Wavelet Transform . . . . . . . . . . . . 2.12.8 Applications of the Discrete Wavelet Transform 2.13 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xix
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
128 129 131 134 139 140 145 146
3 Geometric Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Homogeneous Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Geometric Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Magnification or Reduction . . . . . . . . . . . . . . . . . . . 3.3.3 Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Skew or Shear . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.5 Specular . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.6 Transposed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.7 Coordinate Systems and Homogeneous Coordinates . 3.3.8 Elementary Homogeneous Geometric Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Geometric Affine Transformations . . . . . . . . . . . . . . . . . . . . 3.4.1 Affine Transformation of Similarity . . . . . . . . . . . . . 3.4.2 Generalized Affine Transformation . . . . . . . . . . . . . 3.4.3 Elementary Affine Transformations . . . . . . . . . . . . . 3.5 Separability of Transformations . . . . . . . . . . . . . . . . . . . . . . 3.6 Homography Transformation . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Applications of the Homography Transformation . . . 3.7 Perspective Transformation . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Geometric Transformations for Image Registration . . . . . . . . 3.9 Nonlinear Geometric Transformations . . . . . . . . . . . . . . . . . 3.10 Geometric Transformation and Resampling . . . . . . . . . . . . . 3.10.1 Ideal Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.2 Zero-Order Interpolation (Nearest-Neighbor) . . . . . . 3.10.3 Linear Interpolation of the First Order . . . . . . . . . . . 3.10.4 Biquadratic Interpolation . . . . . . . . . . . . . . . . . . . . . 3.10.5 Bicubic Interpolation . . . . . . . . . . . . . . . . . . . . . . . 3.10.6 B-Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . . 3.10.7 Interpolation by Least Squares Approximation . . . . . 3.10.8 Non-polynomial Interpolation . . . . . . . . . . . . . . . . . 3.10.9 Comparing Interpolation Operators . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
149 149 149 151 154 154 155 155 156 156 156
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
157 160 161 161 163 166 167 172 174 177 178 181 184 189 190 192 194 197 200 201 202 208
. . . . . . . .
. . . . . . . .
xx
Contents
4 Reconstruction of the Degraded Image: Restoration . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Noise Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Gaussian Additive Noise . . . . . . . . . . . . . . . . . . . . . 4.2.2 Other Statistical Models of Noise . . . . . . . . . . . . . . . 4.2.3 Bipolar Impulse Noise . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Periodic and Multiplicative Noise . . . . . . . . . . . . . . . 4.2.5 Estimation of the Noise Parameters . . . . . . . . . . . . . . 4.3 Spatial Filtering for Noise Removal . . . . . . . . . . . . . . . . . . . . 4.3.1 Geometric Mean Filter . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Harmonic Mean Filter . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Contraharmonic Mean Filter . . . . . . . . . . . . . . . . . . . 4.3.4 Order-Statistics Filters . . . . . . . . . . . . . . . . . . . . . . . 4.3.5 Application of Spatial Mean Filters and on Order Statistics for Image Restoration . . . . . . . . . . . . . . . . . 4.4 Adaptive Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Adaptive Median Filter . . . . . . . . . . . . . . . . . . . . . . . 4.5 Periodic Noise Reduction with Filtering in the Frequency Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Notch Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Optimum Notch Filtering . . . . . . . . . . . . . . . . . . . . . 4.6 Estimating the Degradation Function . . . . . . . . . . . . . . . . . . . 4.6.1 Derivation of HD by Observation of the Degraded Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Derivation of HD by Experimentation . . . . . . . . . . . . 4.6.3 Derivation of HD by Physical–Mathematical Modeling: Motion Blurring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.4 Derivation of HD by Physical–Mathematical Modeling: Blurring by Atmospheric Turbulence . . . . . . . . . . . . . 4.7 Inverse Filtering—Deconvolution . . . . . . . . . . . . . . . . . . . . . 4.7.1 Application of the Inverse Filter: Example 1 . . . . . . . 4.7.2 Application of the Inverse Filter: Example 2 . . . . . . . 4.7.3 Application of the Inverse Filter: Example 3 . . . . . . . 4.8 Optimal Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Filtro di Wiener . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 Analysis of the Wiener Filter . . . . . . . . . . . . . . . . . . 4.8.3 Application of the Wiener Filter: One-Dimensional Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.4 Application of the Wiener Filter: Two-Dimensional Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Power Spectrum Equalization—PSE Filter . . . . . . . . . . . . . . . 4.10 Constrained Least Squares Filtering . . . . . . . . . . . . . . . . . . . . 4.11 Geometric Mean Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
209 209 210 212 214 217 217 218 219 219 220 221 221
. . 222 . . 225 . . 226 . . . .
. . . .
228 230 232 235
. . 235 . . 236 . . 236 . . . . . . . .
. . . . . . . .
238 239 242 243 244 244 244 253
. . 254 . . . .
. . . .
255 257 259 260
Contents
4.12 Nonlinear Iterative Deconvolution Filter . 4.13 Blind Deconvolution . . . . . . . . . . . . . . . 4.14 Nonlinear Diffusion Filter . . . . . . . . . . . 4.15 Bilateral Filter . . . . . . . . . . . . . . . . . . . 4.16 Dehazing . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . .
xxi
. . . . . .
. . . . . .
. . . . . .
. . . . . .
262 264 265 267 268 269
5 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Regions and Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 The Segmentation Process . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Segmentation by Global Threshold . . . . . . . . . . . . . 5.4 Segmentation Methods by Local Threshold . . . . . . . . . . . . . 5.4.1 Method Based on the Objects/Background Ratio . . . 5.4.2 Method Based on Histogram Analysis . . . . . . . . . . . 5.4.3 Method Based on the Gradient and Laplacian . . . . . 5.4.4 Method Based on Iterative Threshold Seclection . . . 5.4.5 Method Based on Inter-class Maximum Variance - Otsu . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.6 Method Based on Adaptive Threshold . . . . . . . . . . . 5.4.7 Method Based on Multi-band Threshold for Color and Multi-spectral Images . . . . . . . . . . . . . . . . . . . . 5.5 Segmentation Based on Contour Extraction . . . . . . . . . . . . . 5.5.1 Edge Following . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Connection of Broken Contour Sections . . . . . . . . . 5.5.3 Connected Components Labeling . . . . . . . . . . . . . . 5.5.4 Filling Algorithm for Complex Regions . . . . . . . . . . 5.5.5 Contour Extraction Using the Hough Transform . . . 5.6 Region Based Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Region-Growing Segmentation . . . . . . . . . . . . . . . . 5.6.2 Region-Splitting Segmentation . . . . . . . . . . . . . . . . 5.6.3 Split-and-Merge Image Segmentation . . . . . . . . . . . 5.7 Segmentation by Watershed Transform . . . . . . . . . . . . . . . . 5.7.1 Watershed Algorithm Based on Flooding Simulation 5.7.2 Watershed Algorithm Using Markers . . . . . . . . . . . . 5.8 Segmentation Using Clustering Algorithms . . . . . . . . . . . . . 5.8.1 Segmentation Using K-Means Algorithm . . . . . . . . . 5.8.2 Segmentation Using Mean-Shift Algorithm . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
271 271 272 272 273 275 275 276 277 278
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
287 289 290 292 295 297 298 309 309 312 313 315 315 317 319 321 321 331
6 Detectors and Descriptors of Interest Points . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Point of Interest Detector—Moravec . . . . . . . 6.2.1 Limitations of the Moravec Operator .
. . . .
. . . .
. . . .
333 333 335 338
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . .
. . . 280 . . . 284
xxii
Contents
6.3
Point of Interest Detector—Harris–Stephens . . . . . . . . . . . . . 6.3.1 Limits and Properties of the Harris Algorithm . . . . . 6.4 Variations of the Harris–Stephens Algorithm . . . . . . . . . . . . 6.5 Point of Interest Detector—Hessian . . . . . . . . . . . . . . . . . . . 6.6 Scale-Invariant Interest Points . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Scale-Space Representation . . . . . . . . . . . . . . . . . . . 6.7 Scale-Invariant Interest Point Detectors and Descriptors . . . . 6.7.1 SIFT Detector and Descriptor . . . . . . . . . . . . . . . . . 6.7.2 SIFT Detector Component . . . . . . . . . . . . . . . . . . . 6.7.3 SIFT Descriptor Component . . . . . . . . . . . . . . . . . . 6.7.4 GLOH Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8 SURF Detector and Descriptor . . . . . . . . . . . . . . . . . . . . . . 6.8.1 SURF Detector Component . . . . . . . . . . . . . . . . . . 6.8.2 SURF Descriptor Component . . . . . . . . . . . . . . . . . 6.8.3 Harris–Laplace Detector . . . . . . . . . . . . . . . . . . . . . 6.8.4 Hessian–Laplace Detector . . . . . . . . . . . . . . . . . . . . 6.9 Affine-Invariant Interest Point Detectors . . . . . . . . . . . . . . . . 6.9.1 Harris-Affine Detector . . . . . . . . . . . . . . . . . . . . . . 6.9.2 Hessian-Affine Detector . . . . . . . . . . . . . . . . . . . . . 6.10 Corner Fast Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.10.1 SUSAN—Smallest Univalue Segment Assimilating Nucleus Detector . . . . . . . . . . . . . . . . . . . . . . . . . . 6.10.2 Trajkovic–Hedley Segment Test Detector . . . . . . . . 6.10.3 FAST—Features from Accelerated Segment Test Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.11 Regions Detectors and Descriptors . . . . . . . . . . . . . . . . . . . . 6.11.1 MSER—Maximally Stable Extremal Regions Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.11.2 IBR—Intensity Extrema-Based Regions Detector . . . 6.11.3 Affine Salient Regions Detector . . . . . . . . . . . . . . . 6.11.4 EBR—Edge-Based Region Detector . . . . . . . . . . . . 6.11.5 PCBR—Principal Curvature Based Region Detector . 6.11.6 SISF—Scale-Invariant Shape Features Detector . . . . 6.12 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
340 346 348 349 350 353 357 357 357 363 368 369 369 376 381 383 386 387 394 395
. . . 395 . . . 397 . . . 399 . . . 403 . . . . . . . .
. . . . . . . .
. . . . . . . .
404 406 408 410 412 416 418 422
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
1
Local Operations: Edging
In the early stages of the vision, some intrinsic characteristics are extracted from the image that are relevant to identify the objects of the scene. These elementary characteristics, called low level, are used by the visual system to isolate the objects of the scene, perceived as relatively homogeneous regions with respect to one or more of the following attributes: gray level (intensity), color, texture, distance, motion, nature of the material, surface condition of objects, etc. To an artificial vision system, analogous to the visual perception, the acquired image of the scene presents itself as a group of regions with abrupt discontinuities evaluated among themselves that can be assessed on the basis of some of the attributes indicated above. In this chapter, we are interested in the analysis and search of image pixels where these discontinuities occur which are typically associated with the edges of the objects, that is, to boundary zones between regions of the image with different attributes, or, more in general, for the discontinuity of reflectance and different illumination of objects. All these causes make the edge extraction operation not easy considering also the intrinsic noise of the images. The operation of extracting the edge of an image is still today a fundamental task for the analysis of the images. In this chapter, the most common algorithms will be described to determine the edges with local operators known in the literature as Local Edge Detector—LED. These algorithms assign a value to each pixel automatically evaluated as an edge element but no information is generated to link the various edge pixels (Link edge) together to form the edge segments. Other algorithms will be described in the following to link together the edging pixels belonging to the same contour. The L E D algorithms are local operators that determine local variations through a direct analysis of the gray-level values of the image or through local variations of the first derivative of the intensity function. These local discontinuities near the edges can be of various types. A brief description of these discontinuities and a graphical representation on the edge profile is given below.
© Springer Nature Switzerland AG 2020 A. Distante and C. Distante, Handbook of Image Processing and Computer Vision, https://doi.org/10.1007/978-3-030-42374-2_1
1
2
1 Local Operations: Edging
α
ΔL
a)
i
b)
i
c)
i
Fig. 1.1 Discontinuity: a by step b to line c linear
Fig. 1.2 Linear discontinuity with ascent and descent
Step discontinuity (Step Edge): this is the ideal case in which the intensity values change abruptly from a given value (associated with the pixels of a region) to a different value (associated with the pixels of a different adjacent region) as shown in Fig. 1.1a. Line discontinuity: it is another ideal case in which the intensity values change abruptly for a few adjacent pixels and then immediately return to the initial values (Fig. 1.1b). Due to the instability of the sensory devices of the acquisition systems that produce unwanted high-frequency components, real images include abrupt step and line discontinuities. These discontinuities of the real images are presented with different profiles and are thus articulated in the following geometric shapes. Linear discontinuity (Ramp Edge): step discontinuities are transformed into linear discontinuity with a less abrupt and more continuous trend, affecting different pixels (Fig. 1.1c). Linear discontinuity with ascent and descent (Roof Edge): transforms a line discontinuity into a double linear discontinuity with ascent and descent, affecting different pixels that normally represent a region of separation between homogeneous adjacent areas (Fig. 1.2). In real images, more complex discontinuities can occur, generated by the combination of step and line discontinuities (Fig. 1.3). The contours of a region or object are normally generated by step discontinuities because the gray-level variations between background and object are prominent. Variations in local intensity (discontinuity) can be assessed quantitatively through three measurements: (a) ramp height (change in intensity level), (b) inclination of the ramp, (c) horizontal position of the center pixel in the ramp part. The peculiarity of the L E D algorithms, in addition to determining the edge pixels, is to estimate with good approximation their position in the image. In real images,
1 Local Operations: Edging
3
Fig. 1.3 Combination of step and line discontinuities
Fig. 1.4 Profile of gray levels for an image line
this is made difficult for the presence of noise, and local intensity variations are not always caused by local geometric variations in the scene or the dishomogeneity of adjacent regions, but are due to components of reflected mirror light and/or multiple light reflection. In Fig. 1.4, the profile of the gray levels of a real image, although not complex, is shown as locally the discontinuities (in this case due to the transition between the background and the coins) are only an ideal approximation of step and line discontinuities. The L E D algorithms, in relation to their functional modality, can be grouped into two categories: (a) Differentiation-based operators; (b) Operators based on edge model approximation. The operators of the first category have two phases: the spatial analysis and the calculation of the edges. With spatial analysis, an image is elaborated to produce a gradient image with the objective of exalting all the significant discontinuities present. With the second phase, the calculation of the edges, from the gradient image, we move to the calculation of the localization of the edges associated with the discontinuities determined in the differentiation process. The operators of the second category propose to approximate homogeneous regions of the image with predefined models of step or line discontinuity, operating in the geometric space.
4
1 Local Operations: Edging
The L E D algorithms of the first category, based on differential geometry, can be further grouped into algorithms that make use of the derivative of the first and second order, both in the spatial domain. In particular, the first derivative must be (a) null, in areas with constant intensity levels, (b) non-null, in areas of discontinuity (e.g., step), (c) non-null, in the increasing or decreasing intensity areas (ramps).
1.1 Basic Definitions Before describing the L E D algorithms, it is useful to consider some definitions and terms that will be essential in the following. Edge element. An edge element is a point in the image located by the coordinates (i, j) where a discontinuity in the intensity value occurs. Edge tract. Consists of a set of edge points aligned in a rectilinear way, and is characterized by the coordinates (i, j, θ ) that locate the edge tract (for example, an edge end) in the image plane in the position (i, j) and is oriented by an angle θ with respect to a reference axis (for example, the horizontal axis). LED algorithm. Local Edge Detector is an algorithm that generates a set of edge components (elements and tracts) present in an image. Contour. It is a list of edges (elements and tracts) that normally represents (delimits or encloses) a homogeneous region or an object present in the image. A contour can be represented by a curve described mathematically by a function that models (approximates) a list of edges. Connecting the edges. Edge chaining is a process that aims to connect edge fragments belonging to the same contour. This contour is interrupted due to missing border fragments due to, for example, the noise present in the image or by overlapping objects. The final results of this process lead to an ordered list of such edges. Tracking the edges. The process of tracking the edges starting from an element or edge fragment aims to search all the edges associated with the same contour. In other words, it produces an ordered list of edges that may belong to a region or to a certain user-defined edge concept. In the following, with the word edge, it will be understood in an interchangeable way both the element and the edge fragment. The edges identify pixels in the image at the boundary between regions that are relatively homogeneous with respect to one or more of the various attributes: gray values, color, texture, motion, shape changes, etc. In the consolidated model of visual perception, based on the paradigm from signals to symbols, in the first levels of perception, the edges are the first information (raw data) fundamental to the intermediate and final processes of vision, which transform these information in a more useful and explicit representation of the scene. They are
1.1 Basic Definitions
5
the set of edges, extracted from the L E D algorithms, which are used to partition the image into homogeneous regions that identify particular attributes of the objects of the scene. The algorithms of the intermediate and final levels of the vision must however take into account that due to the noise between the edges extracted, there are edges artifacts, missing edges, and false edges. Below are described several local operators for edge extraction. These operators, in analogy to the smoothing operators, will be defined as filters and the coefficients of the convolution mask (impulse response of the filter) must be determined in an appropriate way working in the spatial domain. Alternatively, by operating in the frequency domain, the appropriate transfer function is defined. In this last domain, some filters for the extraction of the edges will be described, based on the high-pass approach, i.e., pass the components of the high frequencies (of the Fourier transform) as opposed to the smoothing algorithms based on the low-pass filters (described in Chap. 9 Vol.I). In some cases, to analyze the effectiveness of some filters, we will analyze the results in both the spatial and the frequency domains, as happened for the smoothing filters.
1.2 Gradient Filter Smoothing filters eliminate the high-frequency structures present in the image. On the contrary, the edging filters must exalt the high-frequency structures, i.e., the discontinuities of the intensity values, and eliminate the low frequencies, that is, the small variations of intensity values. A mathematical operator that best highlights the discontinuity of a function is the first derivative operator. In the one-dimensional, as can be seen from Fig. 1.5a, given the function f (x), the maximum variation point of the function is obtained at the maximum of the first derivative f x (x). It can be guessed that this point corresponds to a point of edge around which the derivative f x assumes great values of maximum variability. For a two-dimensional image f (x, y), the first partial derivatives ∂ f /∂ x and ∂ f /∂ y detect the discontinuity of the image in the direction of the x and y coordinate axes, respectively. The orientation of the edges in the image does not necessarily coincide with the coordinate axes. It follows the need to calculate the directional derivatives of the image f for each point (x, y) in the direction r of maximum discontinuity of the gray levels (local maximum of the derivative of f ) as shown in Fig. 1.5b from the straight line r passing through point A. The operator who presents these characteristics is the gradient. The gradient of f (x, y) at point A in the generic direction r with angle θ (referred to the x axis) is given by1 ∂f ∂ f ∂x ∂ f ∂y ∂f ∂f = + = cos θ + sin θ = f x (x, y) cos θ + f y (x, y) sin θ ∂r ∂ x ∂r ∂ y ∂r ∂x ∂y
1 We
(1.1)
will see later that the gradient ∇ f does not depend on how the coordinate axes are oriented.
6
1 Local Operations: Edging
Fig. 1.5 Definition of edge through the concept of gradient by analyzing the pixel intensity discontinuity assuming continuous image function f (x, y). In a are shown the graphs of f (x) and its derivative f x (x) which represents the 1D profile of the gray levels of a segment of the 2D image or of an image line. In this case, the point of intensity discontinuity coincides with the maximum of the derivative. In b, a 2D image is sketched with contours and a point A where we want to calculate if it is an edge point. In the text, it is explained that this condition occurs if in A we have the maximum value of the gradient of f with the edge direction determined orthogonal to the direction of maximum variation f (x, y) and therefore of the maximum value of the gradient ∇ f
We are interested in the maximum value of the gradient ∂ f /∂r , oriented in the maximum variation direction of function f (x, y). This is achieved by deriving the gradient function relative to θ and placing it equal to zero, as follows: ∂f ∂ ∂f ∂f =− sin θ + cos θ = 0 (1.2) ∂θ ∂r ∂x ∂y from which θgmx (x, y) = arctan( f y / f x )
(1.3)
where θgmx (x, y) is the direction of maximum variation of f (i.e., the direction perpendicular to the edge) with respect to the x axis, with f x and f y the partial derivatives of f , relative, respectively, to the x axis and to the y axis. From now on, it is called the gradient vector the maximum value of the derivative (∂f/∂r)gmx in the direction of maximum variation of f , which in differential notation is defined by Grad f = ∇f =
∂f ∂f i+ j ∂x ∂y
(1.4)
where i and j are in this case, the unit vectors of the coordinate axes, respectively, of x and y. A quantitative measure of the maximum variation of f in the direction defined by the gradient ∇ f is given by the gradient vector magnitude defined as follows: |∇ f | = f x2 + f y2 (1.5) The magnitude and direction of the gradient are independent of the coordinates (x, y) of the reference system. This can be demonstrated by considering that the image f is a scalar function whose value depends only on the pixel value A and not on its
1.2 Gradient Filter
7
coordinates (x, y). Moreover, if one considers in A, a unitary vector r, coincident with the direction of the straight line r , is also independent of the choice of the coordinate axes. Therefore, the directional derivative Dr f of f with respect to the r direction, in the inner product form between vectors is ∂f Dr f = = r · ∇f = |r | |∇ f | · cos α = |∇ f | cos α (1.6) ∂r where α is the angle between the unit vector r and the gradient vector ∇f. The value of the directional derivative Dr f is maximum when cos α = 1 which occurs with α = 0◦ (i.e., when the vectors r and ∇f are parallel, see Fig. 1.5b) and consequently, we have Dr f = |∇ f |. This demonstrates the assertion that the magnitude and the direction of the gradient of f are independent of the choice of reference coordinates (Properties of the Gradient). We summarize the results achieved so far in order to apply the gradient function for the calculation of edges. 1. The directional derivative or directional gradient of the image function f expresses a measure of the variations in intensity of f with respect to an r direction: ∂f Dr f = = f x cos θ + f y sin θ ∂r 2. The gradient vector ∇f has the components coinciding with the partial derivatives of the function f with respect to the x and y coordinate axes: ∇ f = fx i + f y j where i and j are the ver sor s 2 of the coordinate system (x-y). 3. The gradient vector ∇f has the direction corresponding to the derivative Dr f coinciding with the maximum value of the intensity variations of the image. It is perpendicular to the edge, and relative to the x axis, has an angle given by θgmx = arctan( f y / f x ) (1.7) 4. The magnitude of the gradient is defined as |∇ f | = f x2 + f y2 To reduce the computational complexity, the magnitude of the gradient can be approximated by considering the absolute values: |∇ f | ∼ = | fx | + f y or |∇ f | ∼ = max | f x | , f y 5. The magnitude |∇ f | and the direction θ gmx of the maximum gradient are independent of the choice of the reference system, i.e., they are invariant with the rotation of the coordinate system. 6. This last property also shows that the magnitude of the gradient is independent of the local orientation of the edge. From this it follows that the gradient filter is an isotropic local operator. 2 In geometry and physics, the versor
of an axis or of a vector is a unit vector indicating its direction.
8
1 Local Operations: Edging
1.3 Approximation of the Gradient Filter Now let’s see how the gradient filter can be calculated through differentiation, which in turn is approximated as a linear filter. Let f (x, y) be a 2D image, the partial derivative with respect to the x axis and in one point (x, y) is given as follows: f x (x, y) =
∂ f (x, y) ∂ f (x + x, y) − f (x, y) = lim ≈ f (x + x, y) − f (x, y) x→0 ∂x x
(1.8)
which is defined as a partial derivative approximated to the asymmetric difference. Similarly, the approximation to the symmetrical difference is determined, given by f x (x, y) =
1 ∂ f (x + x, y) − f (x − x, y) ∂ f (x, y) = lim ≈ [ f (x + x, y) − f (x − x, y)] ∂x 2x 2 x→0
(1.9) Similarly, it can be obtained for the partial derivative from the y axis. With respect to both axes, it is possible to derive an approximation equivalent to Eq. (1.8) for the asymmetrical backward difference. For a digital image of dimension M × N , the gradient vector (magnitude and orientation), therefore, by virtue of the equations (1.8) and (1.9), can be approximated by substituting the differentiation in two orthogonal directions along the respective coordinate axes with the symmetric and asymmetrical differences in the respective x and y directions. The following approximations of f x are given for the rows: f x (i, j) = f (i, j + 1) − f (i, j)
Asymmetrical forward difference
f x (i, j) = f (i, j) − f (i, j − 1) Asymmetrical backward difference 1 Symmetric difference f x (i, j) = [ f (i, j + 1) − f (i, j − 1)] 2
(1.10) (1.11) (1.12)
Similarly for the differences on the columns, the approximations of f y are given: f y (i, j) = f (i + 1, j) − f (i, j)
Asymmetrical forward difference
f y (i, j) = f (i, j) − f (i − 1, j) Asymmetrical backward difference 1 Symmetric difference f y (i, j) = [ f (i + 1, j) − f (i − 1, j]) 2
(1.13) (1.14) (1.15)
These approximations lead to an implementation as linear filters through the following h convolution masks:
1• + + h R = −1• 1 hC = −1
1 − − h R = −1 1• hC = −1• ⎡ ⎤ 1 1 1 s −1 0 1 s h C = ⎣ 0 ⎦ hR = 2 2 −1 where the indices R and C in h identify the convolution masks, respectively, to calculate the gradient along the rows and columns. The symbol “•” indicates the pixel
1.3 Approximation of the Gradient Filter
9
Fig. 1.6 Evaluation of the displacement of the edges on the test image introduced with the filtering to the differences with the convolution masks + h R , +h C
being processed in the case of asymmetric masks (with two elements in the previous case). The symbols (+, −, and s), respectively, indicate the convolution masks calculated, respectively, with the forward, backward, and symmetric difference. With the approximations calculated in this way, the gradient components f x and f y are localized not at the same coordinates as the pixel being processed (left, right, or center pixels) but at an intermediate position between the two pixels„ i.e. the point (i, j + 1/2) for f x and in the point (i + 1/2, j) for f y . This corresponds to a shift of the edge position in the gradient image for a half-pixel distance. The effects of the gradient approximation can be better highlighted and quantified by operating in the frequency domain. In the latter domain, the nth differentiation of a function f corresponds to the multiplication of the spectrum value by a factor proportional to the frequency and this leads to the exaltation of the high frequencies. The transfer function in the case of approximation to backward differences is given by − H (u, v) = i sin 2πu for the horizontal component R M − H (u, v) = i sin 2πv for the vertical component C N s H (u, v) = i sin πu e s H (u, v) = i sin πv symmetrical difference R C M N With this approximation, the difference gradient filter deviates very much from the ideal filter based on the transfer function of the derivative filter which is iπ 2u/M · F(u, v) and iπ 2v/N · F(u, v) for the row and column component, respectively. Compared to the transfer functions of smoothing filters, the first-order derivative filters, approximate to the differences, have the disadvantage of altering the original position of the edges (see Fig. 1.6). This is due to the antisymmetry of the above-mentioned convolution masks whose transfer functions present only the sinus
10
1 Local Operations: Edging
Fig. 1.7 Transfer functions of the gradient approximation filters to the forward, backward, symmetric differences corresponding, respectively, to the + h R , + h C , s h R , and s h C impulse responses
components by introducing a translation of 90◦ . Figure 1.7 shows the transfer functions for the convolution masks + h R , + h C , s h R , and s h C . It is also noted that a derivative filter must not give responses in homogeneous areas of the image and it follows that the sum of the mask coefficients must be zero together with the transfer function at zero frequencies (H (0, 0) = 0). The symmetric differences filter s h behaves like a low-pass filter and can be thought of as the combination of a smoothing filter and an edge enhancement filter: 1 1 s 1 1• = −1 0 1 h R =− h R 1 B R = −1• 1 2 2 ⎡ ⎤
1 1 1 1 1 s = ⎣0 ⎦ h C =− h C 1 BC = −1 2 1 2 −1 where B R = 1/2[1 1] represents the two-pixel average elementary smoothing filter. The horizontal and vertical convolution masks found are used only for the extraction of the dominant edges in a direction perpendicular to the operators themselves. For this purpose, the following convolution operators can be applied: g R (i, j) = f (i, j) h R (i, j) =
+1
f (i, l) · h R (l)
(1.16)
f (l, j) · h C (l)
(1.17)
l=−1
gC (i, j) = f (i, j) h C (i, j) =
+1 l=−1
1.3 Approximation of the Gradient Filter
11
Fig. 1.8 Edge detection with filter gradient approximated to finite differences: a original image; b Horizontal differences; and c Vertical differences
The results of these two filters (horizontal and vertical) are highlighted in Fig. 1.8. Recall that the initial goal was to calculate the edges anyway oriented in the image plane. This is achieved by calculating the magnitude of the gradient expressed by one of the following equations: √ |∇ f | = g R · g R + gC · gC (1.18) |∇ f | = |g R | + |gC | (1.19)
As highlighted by the flowchart, the extraction of the edges involves several steps: 1. Filtering the image (Fig. 1.9a) by independently applying the convolution masks h R and h C . 2. Calculate the squared power of each pixel of the two images obtained in step (1). 3. Sum of the homologous pixels of the two images of step 2. 4. Calculation of the gradient magnitude for each pixel of the step (3) image with one of the equations (1.18) and (1.19). The resulting image, Fig. 1.9b or Fig. 1.9c, is the edge image. 5. Calculation of the edges orientation with respect to the x axis (through Eq. 1.3) approximated by the arcotangent of gC /g R (i.e., the components of the horizontal and vertical gradient), quantities calculated in step (1). The resulting image is the edge orientation map, also called the phase map (see Fig. 1.9f). Figure 1.9 shows an example of a gradient and phase image calculated with the approximation to symmetrical differences. Note the complexity of the orientation
12
1 Local Operations: Edging
Fig. 1.9 Example of an image gradient calculated with symmetric difference filters: a Original image; b Image of the horizontal gradient calculated with Eq. (1.12); c Image of the vertical gradient calculated with Eq. (1.15); d Gradient magnitude calculated with Eq. (1.18); e Gradient magnitude calculated with Eq. (1.19); f Gradient direction θgmx calculated with Eq. (1.3); g Gradient magnitude of image d after applying an appropriate threshold to highlight the significant edges
map. From these Considerations, it emerges that the L E D algorithms, that will be presented in the following, will have a common characteristic: approximate in the best possible way, in the discrete, the derivative operator combined appropriately with the noise attenuation operator.
1.4 Roberts Operator An alternative way to approximate the differentiation is that suggested by Roberts, which actually implements a diagonal version of the asymmetric differential filters described above: f x (i, j) = f (i, j) − f (i + 1, j + 1) f y (i, j) = f (i + 1, j) − f (i, j + 1) as shown in Fig. 1.10. The orientation of the edge with respect to the x axis results fy π (1.20) θ (i, j) = + arctan 4 fx
1.4 Roberts Operator
13
j i-1,j i,j-1 i+1,j-1
i-1,j+1
i i,j ,j
i , j+1 i,j+1
i+1 ,j i+1,j
i+1,j+1 i+1 , j+1
i Fig. 1.10 Approximation of the gradient filter by Roberts operator
Fig. 1.11 Edge detection with Roberts filter: a Original image; b result applying the h R mask to extract the horizontal and vertical components c by applying the h C convolution mask; d Final result by applying the gradient magnitude (Eq. 1.18); e Possible edge points extracted from the gradient image d if the magnitude value exceeds a threshold
The convolution masks in the two directions, horizontal and vertical, with Roberts approximation are
1 0 0 1 hR = hC = (1.21) 0 −1 −1 0 The horizontal and vertical components g R (i, j) and gC (i, j) of the gradient are calculated at the interpolation point (i + 1/2, j + 1/2) with the usual convolution formulas g R = f h R and gC = f h C . Figure 1.11 shows the results of the Roberts filter by applying the previous h R and h C masks and then calculating the gradient magnitude with Eq. (1.18). From a qualitative evaluation, one can notice, for the same image, a slight improvement compared to the filtering based on the symmetrical differences of Fig. 1.8.
1.5 Gradient Image Thresholding The gradient image obtained with any approximation method described above can be modified to produce gradient images with different types of edges. Given the gradient image g(i, j), some modalities can be defined to check if a pixel g(i, j) is an edge element or not. Let’s analyze some:
14
1 Local Operations: Edging
(a) Modalit y 1:
g (i, j) =
∇ f (i, j) i f |∇ f (i, j)| ≥ T f (i, j) other wise
where T is a nonnegative threshold value. From the chosen value of T depends on whether small variations in intensity are to be classified as border or to be considered homogeneous zones affected by noise. (b) Modalit y 2: i f |∇ f (i, j)| ≥ T I g (i, j) = T f (x, y) other wise where IT is a gray value to be assigned to the pixels classified as edges by the threshold T as before. (c) Modalit y 3: ∇ f (i, j) i f |∇ f (i, j)| ≥ T g (i, j) = other wise IB where I B is a gray value to be assigned to pixels not classified as edges by the threshold T (separation of edges from the background). (d) Modalit y 4: I i f |∇ f (i, j)| ≥ T g (i, j) = T I B other wise obtaining in this case a binary image, and precisely pixels with the value of IT corresponding to the edges, and all the others with values I B representing the background.
1.6 Sobel Operator The edge extraction Operators, already described based on the gradient (gradient filter, Roberts filter), generate borders with a thickness of at least two pixels. On the other hand, these operators are sensitive to noise due to small fluctuations of the intensity, and also to the approximate values of the gradient components estimated on a limited number of pixels. To mitigate these drawbacks, a more complex gradient operator is used, which simultaneously performs the differentiation with respect to a coordinate axis and calculates a local average in the orthogonal direction. The components of the gradient with the Sobel operator are calculated in the direction of the coordinate axes x and y, involving for each pixel processing (i, j) the pixels in its vicinity included in the 3 × 3 window as shown in the following figure: A H G
B (i,j) F
C D E
1.6 Sobel Operator
15
The differentiation in the x and y directions is approximated as follows: f x (i, j) = (C + K D + E) − (A + K H + G) (1.22) f y (i, j) = (A + K B + C) − (G + K F + E) (1.23) where the constant K is chosen equal to 2 and the pixels close to that in elaboration (i, j) have an influence in relation to the corresponding weights indicated with A, B, · · · , E. By choosing a larger 3 × 3 window to estimate the gradient, the smoothing effect increases compared to the operator with the 2 × 2 window, thus decreasing the operator’s sensitivity to the fluctuations of the intensity values present in the image. The convolution masks for the Sobel operator are ⎡ ⎤ ⎡ ⎤ −1 0 1 1 2 1 hC = ⎣ 0 0 0 ⎦ h R = ⎣ −2 0 2 ⎦ −1 0 1 −1 −2 −1 It can be noticed how in the masks the sum of the weights is equal to zero. This will result in non-determination of edges in the presence of pixels with constant gray values. The weights in the masks are appropriate for modeling an ideal edge present in the direction of the coordinated axes. It is noted that the operator assigns a greater weight to the pixels that are closer to the pixel (i, j) in elaboration and presents than the latter a symmetricity. Finally, the line and column gradients can be normalized to ensure unitary weighted averages. Discrete convolutions for the horizontal and vertical components of the Sobel operator are g R (i, j) = f (i, j) h R (i, j) =
+1 +1
f (i + l, j + k)h R (l, k)
l=−1 k=−1
gC (i, j) = f (i, j) h C (i, j) =
+1 +1
f (i + l, j + k)h C (l, k)
l=−1 k=−1
Recall that the convolution mask 3 × 3 overlaps its central element (0, 0) with the pixel in processing (i, j) of the image. The gradient magnitude can be computed using Eq. (1.18) or (1.19). Figure 1.12 shows the results of the Sobel filter for a test image that has curved dominant edges, more complex than that used for the Roberts filter characterized by straight dominant edges.
Fig. 1.12 Edge detection with Sobel filter: a Original image; b results with the application of the mask h R to extract the horizontal and vertical components c by applying the mask h C ; d the final result by applying the equation gradient (1.18); e Edge points extracted from the gradient image d if the magnitude value exceeds a threshold
16
1 Local Operations: Edging
1.7 Prewitt Operator It differs from the Sobel operator only by the value of the constant K which is halved, i.e., place equal to 1. In this way, the convolution masks become ⎡ ⎤ ⎡ ⎤ −1 0 1 1 1 1 h R = ⎣ −1 0 1 ⎦ hc = ⎣ 0 0 0 ⎦ −1 0 1 −1 −1 −1 Unlike the Sobel operator, the nearest pixels (north, south, east, west) to the pixel being processed do not have a greater weight but contribute to the estimation of the spatial gradient with the same weight.
1.8 Frei & Chen Operator √ This operator assigns the value 2 to the constant K by weighing the neighboring pixels homogeneously and thus the value of the gradient is the same in the presence of vertical, horizontal, and oblique edges. This operator is a first approximation of the isotropic filter, that is, it determines the intensity changes without any preferential orientation (isotropic property). √ ⎤ ⎡ ⎡ ⎤ −1 1 2 1 √ 0 √1 0 0 ⎦ hC = ⎣ 0 √ hR = ⎣ − 2 0 2 ⎦ −1 0 1 −1 − 2 −1
1.9 Comparison of LED Operators As for all the image processing algorithms, also for the evaluation of a specialized operator to extract the edges, the following aspects should be considered: (a) how reliable is the chosen physical–mathematical model; (b) how efficient is the computational model; (c) how accurate is the result achieved. With regard to reliabilit y, the choice of the gradient function to determine the local variability of the gray levels proves to be adequate. The ideal gradient function can be calculated with the differentiation of the input function. For digital images, the gradient function is approximated using gray-level differences and affecting a limited number of pixels. The two quantities that best give a quantitative measure of the local discontinuities are the magnitude and the direction of the gradient. Discontinuities in the image are not always associated with intrinsic image information as they are often accidentally introduced by the digitization process and the instability of image acquisition sensors.
1.9 Comparison of LED Operators
17
Discontinuities associated with noise can be attenuated, before the edge enhancement process, with a smoothing process to prevent the formation of spurious edges and point artifacts and stretches of curved lines. The computational model for the extraction of the edges, therefore, includes two contrasting activities: on one side, the attenuation of the noise and on the other, the exaltation of the edges. The effectiveness of the operators described above, based on asymmetrical and symmetrical differences, Roberts, Sobel, Prewitt, Frei & Chen, depends on how these operators balance the contrasting effects of smoothing and edge enhancement. A way to control this balance is achieved by choosing an adequate threshold value T of the estimated gradient value. The accuracy of the results mainly relates to the level of approximation of the gradient, and also depends on the size of the convolution mask. Most of the operators treated are not able to define and locate an edge with a pixel resolution. Not all real edges are extracted, probably, many are filtered by the smoothing component (necessary to attenuate the noise) of the edge extraction operator. On the contrary, not all extracted edges are real, since many are caused by the effect of sharpening which exchanges the variations of intensity due to the noise like edges present in the image. All the operators described are implemented as a single computational model, that is the discrete convolution, except for the Roberts filter, we use masks starting from 3 × 3 dimensions. In the examples shown, the smoothing operation is performed with a Gaussian filter using a 7 × 7 mask. The values of the gradient thresholds are reported from time to time. Figure 1.13 shows the comparison of the results of Roberts, Sobel, Prewitt, and Frei & Chen operators applied to two types of test images, one with horizontal and vertical dominant straight edges and the other with curved edges. The second and fourth lines show the results of filtering on the same test images with 0.01% Gaussian additive noise. The binary image of the edges obtained from the gradient image of each Filter is displayed by applying a threshold T = 0.5, having normalized the gradient between 0 and 1. It is possible to evaluate certain limits of the operators on the image with dominant curved edges and in particular, the sensitivity to the noise that causes the production of false unreal edges. Figure 1.14 shows the comparison between the same operators and the same test images but in this case, the added noise is of the salt and pepper type at 0.1%. For this type of noise, the operators are inadequate not possessing an intrinsic capacity sufficient to attenuate the noise. This problem can be attenuated by pre-filtering the image with a smoothing filter. The operators of Sobel, Prewitt, and Frei & Chen provide better results for determining the edges than difference-based operators, such as the Roberts operator, which calculates the average gray levels on a smaller number of pixels. The limits of the gradient operators described above can be partially overcome by increasing the number of pixels involved in the gradient estimation and thus improving the smoothing function.
18
1 Local Operations: Edging Sobel
Prewitt
Frei&chen
Gaussian noise .01
Original
Gaussian noise .01
Original
Roberts
Fig. 1.13 Application of the operators of Roberts, Sobel, Prewitt, and Frei & Chen on two types of test images, one with horizontal and vertical straight dominant edges and the other with curved edges. The second and fourth rows show the filtering results on the same test images with Gaussian noise at 0.01%
As a first example can be adapted, the Prewitt operator, with a convolution mask 7 × 7, is modified as follows: ⎤ ⎡ −1 −1 −1 0 1 1 1 ⎢ −1 −1 −1 0 1 1 1 ⎥ ⎥ ⎢ ⎢ −1 −1 −1 0 1 1 1 ⎥ ⎥ ⎢ 1 ⎢ −1 −1 −1 0 1 1 1 ⎥ hR = ⎢ 21 ⎢ −1 −1 −1 0 1 1 1 ⎥ ⎥ ⎥ ⎢ ⎣ −1 −1 −1 0 1 1 1 ⎦ −1 −1 −1 0 1 1 1 As a second example, the pyramid-truncated operator can be considered, which in a linearly decreasing way weighs the importance of pixels close to the pixel being processed. The convolution mask for calculating the horizontal gradient component
1.9 Comparison of LED Operators
19 Sobel
Prewitt
Frei&chen
Salt & pepper noise 0.1
Original
Salt & pepper noise 0.1
Original
Roberts
Fig. 1.14 Application of the operators of Roberts, Sobel, Prewitt, and Frei & Chen on two types of test images, one with horizontal and vertical straight dominant edges and the other with curved edges. In the second and fourth rows, the results of the filtering are shown on the same test images with salt and pepper noise at 0.1%
is given by
⎡
−1 ⎢ −1 ⎢ ⎢ −1 1 ⎢ ⎢ −1 hR = 34 ⎢ ⎢ −1 ⎢ ⎣ −1 −1
−1 −2 −2 −2 −2 −2 −1
−1 −2 −3 −3 −3 −2 −1
0 0 0 0 0 0 0
1 2 3 3 3 2 1
1 2 2 2 2 2 1
⎤ 1 1⎥ ⎥ 1⎥ ⎥ 1⎥ ⎥ 1⎥ ⎥ 1⎦ 1
To get more generally, a more extensive convolution mask for any gradient operator, it can be convolved 3 × 3 masks of the operators described above with a smoothing operator mask. The combined convolution mask results as follows: h(i, j) = h G R AD (i, j) h S M O O T H
(1.24)
where h G R AD indicates one of the N × N convolution masks of the gradient operator and h S M O O T H indicates the convolution mask of a smoothing operator. If we consider the h R (i, j) masks of the Prewitt operator and h S M O O T H (i, j) of the 3 ×
20
1 Local Operations: Edging
3 average smoothing filter, we obtain the combined convolution mask of the 5 × 5 gradient operator, in the following form: −1 −1 0 1 1 −1 0 1 −2 −2 0 2 2 1 1 1 1 1 1 −3 −3 0 3 3 h R = −1 0 1 1 1 1 = 3 −1 0 1 9 1 1 1 18 −2 −2 0 2 2 −1 −1 0 1 1 Similarly, the mask of the vertical gradient component h C is used. The Sobel and Prewitt filters described above can be obtained by applying the (1.24) by combining the 1D average smoothing filter and the 1D symmetric gradient filters (Eqs. 1.12 and 1.15). Prewitt filters are obtained by applying the (1.24) as follows: 1 1 −1 0 1 1 1 −1 0 1 = −1 0 1 hR = 1 3 1 2 6 −1 0 1 1 1 1 1 1 1 1 1 1 1 = 0 0 0 hR = 0 2 −1 3 6 −1 −1 −1 Sobel filters are obtained by combining the approximate 1D Gaussian smoothing filter with the 1D symmetrical gradient filter: 1 −1 0 1 1 1 1 h R = 2 −1 0 1 = −2 0 2 4 1 2 8 −1 0 1 1 1 1 1 2 1 1 1 2 1 = 0 0 0 hR = 0 2 −1 4 8 −1 −2 −1 In Fig. 1.15, the results of Sobel and Prewitt filters (in the second and third column, respectively) are summarized for the comparison with the results obtained by applying the 7 × 7 pyramid-truncated filter (the fourth column shows the gradient images and the fifth column the binary images of the edges) and the Sobel filter applied after the Gaussian smoothing filter of size 5 × 5. As expected, increasing the size of the filter improves its ability to attenuate noise and more efficiently detects edges by decreasing especially the edges artifacts. It also improves the Sobel filter (sixth column) in the sense that they diminish the edges of artifacts. Compared to the classic Sobel and Prewitt filters there is an increase in discontinuity of the edges detected due to the more effective action of the Gaussian smoothing activities. In the fourth column are shown the gradient images for the filter size 7 × 7 to highlight
1.9 Comparison of LED Operators Prewitt
Gradient Mask 7x7 Mask 7x7 Thresh.
Sobel+smooth5+Thresh.
Salt & pep. noise 0.1 Gaussian noise .01 Salt & pep. noise 0.1 Gaussian noise .01
Original
Original
Sobel
21
Fig. 1.15 Application of the truncated pyramid filter of 7 × 7 and of the 3 × 3 size Sobel filter with Gaussian pre-filtering of 5 × 5 on the same images of previous tests and with the same type of noise. The second and third columns show, by comparison, the same results as in Figs. 1.13 and 1.14 relating to the normal Sobel and Prewitt filters. In the fourth and fifth columns are the images of the gradient and the binary images of the edges detected with the truncated pyramid filter. The sixth column shows the binary images of the edges related to the Sobel filter with Gaussian pre-filtering to further attenuate the noise
the edges artifacts generated especially for images with salt and pepper noise not attenuated significantly with the Gaussian pre-filtering. This explains the not good results achieved (see row fourth and sixth) with the filters used that with different potentialities (best results in the fifth and sixth column with the filters 7 × 7 and 5 × 5) have attenuated only the Gaussian noise. It is known that salt and pepper noise can instead be attenuated with the median filter.
22
1 Local Operations: Edging
The filters used so far operate in the dominant directions by line and by columns. In the following paragraphs, also based on Sobel and Prewitt filters, directional filters will be derived for the detection of edges in the preferential directions (diagonals with different angles). For the reasons mentioned above, other compound operators will be introduced to better integrate the two contrasting functions of smoothing and enhancement of the edges.
1.10 Directional Gradient Operator All the operators described above determine the edges by first calculating the orthogonal components of the gradient in the horizontal and vertical direction, and then the gradients are estimated by adding appropriately these components. In several applications, it is convenient to calculate the edges for a defined number of directions. This can be accomplished by the convolution of the input image f (i, j) with different h k masks whose weights model the impulse response of the directional gradient. This type of edge operator based on convolution with a set of masks each sensitive to a given edge orientation is known as compass edge detector. The directional gradient is given by the following convolution: gk (i, j) = f (i, j) h k
(1.25)
where h k is the impulse response of the gradient in the direction θk = π2 + k π4 with k that can assume values from 0 (to indicate the north direction, θ0 = π/2 ) to 7. The directional gradient operator for each pixel is defined as g(i, j) = max |gk (i, j)| k = 0, ..., 7 k
(1.26)
which indicates the inclination of the edge in the direction of the maximum value of the directional gradients calculated in the eight directions. The convolution masks corresponding to the k directions are obtained starting with k = 0 and rotating in a circular way of π /4 each external element of the 3×3 mask. Each of the resulting masks is sensitive to an edge orientation ranging from 0◦ to 315◦ in steps of 45◦ . In Fig. 1.16 are represented all the masks for the compass Prewitt operators, Kirsch, a derivation of Sobel (Robinson level-3) and Robinson level-5 (derived from the Prewitt operator). The sum of the weights of each mask is zero. From this it follows that in the homogeneous areas of the image, the directional gradient operator will always result in zero. Directional gradients will have the maximum value when the gray-level configuration of the pixels better agrees with the directional models represented by the masks. According to Eq. (1.26), for each pixel the maximum response |g(i, j)| is the value of the corresponding pixel in the output magnitude image (that is, the one corresponding to the convolutions with the 8 masks that gave the highest value). In other words, the edge magnitude and orientation of each pixel is then determined by the template that matches the local area of the pixel the best. The compass edge detector is an alternative method to characterize an edge through the magnitude
1.10 Directional Gradient Operator
23
Fig. 1.16 Masks 3 × 3 for the extraction of directional gradients
and orientation. For gradient-based operators, it is necessary to first calculate the horizontal and vertical direction of the gradient magnitude (2 masks are required) to then estimate the edge direction, while with compass edge operators, the edge and orientation are obtained directly from the mask with the maximum response (at least 8 masks with angular resolution of 45◦ ) that corresponds to the best match of the edge mask model.
24
1 Local Operations: Edging
Other compass edge operators can be defined with rotations, for Example, of π /6 (equal to 30◦ of angular resolution) with masks of size 5×5. With even bigger masks, we can calculate the directions of the edges in a more precise way with a good reduction of the noise but with the necessity of considerable calculation time.
1.11 Gaussian Derivative Operator (DroG) In the preceding paragraphs, we have already highlighted the problem of differentiation which introduces a attenuation of the signal noise ratio. With the smoothing operators described (see Sect. 9.12.6 Vol. I) based on Gaussian filtering, the noise is attenuated. The DroG operator has the function of combining the noise reduction activities in the image and then executing the derivative with respect to the coordinate axes for edge detection. Remembering the associative property of the convolution operator, the gradient operator g Dr oG , based on the Gaussian derivative, is obtained by first applying the Gaussian smoothing operator h G and then differentiating with the h D filter. The DroG operator in the horizontal component (indicated with r ) is thus obtained: grDr oG = [ f (i, j) h rG (i, j)] h rD (i, j) = f (i, j) [h rG (i, j) h rD (i, j)] (1.27) The convolution indicated in square brackets in the last expression is the impulse response of the DroG operator that combines the smoothing and derivative filter. By substituting the Gaussian function and deriving, the DroG filter along the horizontal (r ) and vertical (c) axis is ∂ − i 2 +2j 2 i − i2+ j2 e 2σ = − 2 e 2σ 2 ∂i σ (1.28) i2+ j2 ∂ j − i2+ j2 − h cDr oG (i, j) = − e 2σ 2 = − 2 e 2σ 2 ∂j σ with σ the standard deviation, and with h rDr oG and h cDr oG are indicated the combined impulse responses (smoothing and gradient) of the DroG filter. In addition, DroG operator provides the best compromise between noise attenuation (smoothing filter) and edge enhancement (derivative filter). In Sect. 9.12.6 Vol. I h rDr oG (i, j) = −
1D Profile - filter DroG
Filter DroG y−direction
Filter DroG x−direction
3 4
4
2
2
0
0
−2
−2
2 1 0 −1
−4 2
−4 2 1
2 1
0 −1
y
−2 1
2 1
0
0 −1 −2
−2
x
y
−3
0
−1
−1 −2
−2
x
10
20
30
40
x
50
60
70
80
Fig. 1.17 3D graphical representation of the DroG filter relative to both the x and y axes, and display of the 1D profile relative to the x axis
1.11 Gaussian Derivative Operator (DroG) Prewitt
Mask 7x7 Threshold
DroG 3x3
DroG 5x5
Salt & pep. noise 0.1 Gaussian noise .01 Salt & pep. noise 0.1 Gaussian noise .01
Original
Original
Sobel
25
Fig. 1.18 Application of the DroG filter of size 3 × 3 and 5 × 5 and comparison with the results of the 3 × 3 Sobel and Prewitt filter on the same images of previous tests and with the same type of noise. In the second and third columns, the same results of Figs. 1.13 and 1.14 relative to the normal Sobel and Prewitt filters are shown for comparison. The fourth column shows the results of the 7 × 7 pyramid-truncated filter (fifth column of Fig. 1.15). The fifth and sixth columns show binary images of the edges relative to the DroG filter with dimensions of 3 × 3 and 5 × 5, respectively
was described in detail the Gaussian smoothing filter, analyzed the characteristics of the Gaussian function (circular symmetry, separability) and how to design a discrete impulse response based on the Gaussian function. Similarly, a discrete convolution mask can be designed for the DroG filter, given by Eq. (1.28), the size of which is characterized by the standard deviation σ . Figure 1.17 shows the 3D impulse response of the DroG filter for both axes and a 1D profile for the x axis. Figure 1.18 compares the results obtained by the DroG filter with Sobel, Prewitt, and pyramid-truncated filters. It is noted, as expected, almost equivalent results are obtained between 5 × 5 DroG filter and 7 × 7 filter (fifth column of Fig. 1.15). Overall, the DroG filter (even
26
1 Local Operations: Edging
in 3 × 3 dimensions) better solves the problem of both the Gaussian and salt and pepper noise.
1.12 Laplacian Operator We now describe an alternative gradient approach based on the second derivative to automatically model and detect the edges in the image. The linear edge detection operators described so far are based on an approximation of the first derivative to model a zone of discontinuity (ramp) of the intensity levels of an image. With this approach, in continuous, this discontinuity is localized at the maximum value of the first derivative (see Fig. 1.19) with increasing local variation of the intensity levels, conversely, at the minimum for decreasing values. With the second derivative, as shown in Fig. 1.19, the latter assumes zero value by crossing the x axis at the discontinuity. Therefore, while for the first derivative the edge is localized at its maximum or minimum value, for the second derivative it results in the zero, or rather the passage through zero, and has a positive sign in the ramp increasing zone and negative in the zone of descending ramp. It also highlights that a linear operator based on second-order differentiation further accentuates the edges and turns out to be isotropic. For this purpose, the Laplacian operator is used, which for a continuous twodimensional function f (x, y) is defined as ∇ 2 f (x, y) =
∂2 f ∂2 f + ∂x2 ∂ y2
(1.29)
which represents the sum of the second derivatives of a function calculated in the direction of the coordinate axes x and y. Figure 1.19 schematizes a real edge situation with respect to an ideal step edge. It is also noted that while in the gradient image, an edge can be represented by multiple pixels (at the peak of the gradient) and a threshold is normally used to locate it, the thickness of the edges found in the Laplacian image is typically 1 pixel. In fact, in the Laplacian image, the edge is instead well localized and coincident with the point where the second derivative has a trend that passes through zero (not to be confused with the constant trend equal to zero that occurs when f is constant or when f varies linearly). This suggests the idea of applying the Laplacian operator to an image f (i, j) to easily locate the edge pixels corresponding to the pixel where the rising or falling side of ∇ 2 f passes through the zero (from positive to negative and vice versa). This transition of ∇ 2 f to zero is also called zero crossing and corresponds to the edge pixel. The Laplacian operator based on the second derivative is more sensitive to noise than the other operators, it also does not provide edge direction information, and can generate double and tendentially closed edges. In the discrete spatial domain, the Laplacian operator can be approximated by considering the differences of the first derivatives in the directions of the x and y
1.12 Laplacian Operator
27
f(x,y)
x soglia
f’(x,y) x Edge points
f’’(x,y)
+
-
x
-
+
Fig.1.19 1D representation of the relation between the function f (x) which represents the intensity of the image, its first derivative f (x), and its second derivative f
(x); two gray-level transitions are schematized, a positive ramp (transition from dark to light area) and a negative ramp (reverse passage from light to dark). The point of maximum (or minimum) in the first derivative coincides simultaneously with the point of zero crossing in the second derivative that is the passage of the zero, where the point of edge is located
axes. The horizontal component of the Laplacian operator can be approximated as follows: ∂ fx ∂2 f ∂ ( f (x, y + 1) − f (x, y)) = = 2 ∂x ∂x ∂x ∂ f (x, y + 1) ∂ f (x, y) = − ∂x ∂x ≈ [ f (x, y + 2) − f (x, y + 1)] − [( f (x, y + 1) − f (x, y)] = f (x, y + 2) − 2 f (x, y + 1) + f (x, y)
(1.30)
forward difference
With the translation of a pixel horizontally, obtained by substituting y with y-1, we calculate the approximate value of the Laplacian at the point (x, y) defined by ∂2 f ≈ f (x, y + 1) − 2 f (x, y) + f (x, y − 1) ∂x2 Operating in a similar way for the vertical component is obtained: ∂2 f ≈ f (x + 1, y) − 2 f (x, y) + f (x − 1, y) ∂ y2
(1.31)
(1.32)
It follows that Eq. (1.29) of the Laplacian is so approximate: ∇ 2 f (x, y) ≈ f (x+1, y)+ f (x−1, y)−4 f (x, y)+ f (x, y+1)+ f (x, y−1) (1.33)
28
1 Local Operations: Edging
These approximations lead to the corresponding horizontal and vertical convolution masks: ⎡ ⎤ 1 (1.34) h R = 1 −2 1 h C = ⎣ −2 ⎦ 1 From the combination of the horizontal and vertical components, we obtain a 2D convolution mask of the Laplacian known as Laplace operator as follows: ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 0 0 0 0 1 0 0 1 0 h(i, j) = ⎣ 1 −2 1 ⎦ + ⎣ 0 −2 0 ⎦ = ⎣ 1 −4 1 ⎦ (1.35) 0 0 0 0 1 0 0 1 0 which can be implemented as a simple linear filter: ∇ 2 f (x, y) = f (x, y) h(x, y)
(1.36)
or with two separate convolutions with respect to the horizontal and vertical axes using the 1D masks (see Eq. 1.34): ∇ 2 f (x, y) = f (x, y) h R + f (x, y) h CT
(1.37)
The effects of the Laplacian approximation with the forward differences can be better analyzed in the frequency domain. In this domain, the Laplacian transfer function is given by (1.38) F {∇ 2 f (x, y)} = −4π 2 (u 2 + v2 )F(u, v) which equals to the multiplication of the spectrum F(u, v) by a factor proportional to the frequencies (u 2 +v2 ). This leads to an accentuation of the high spatial frequencies and cancels out any constant component of zero frequency. The approximation of the Laplacian differences leads to a slight deviation of the transfer function from the ideal conditions of partial derivatives of the second order with the consequent deviation from the isotropic characteristics, especially for high frequencies (see Fig. 1.20). To mitigate this, Laplacian filter variants were introduced with the best-approximated impulse response using 8 adjacent pixels relative to the pixel being processed to calculate the differences along the rows and columns. To mitigate this drawback, variants of the Laplacian filter with the bestapproximated impulse response were introduced using 8 adjacent pixels with respect to the pixel being processed to calculate the differences along the rows and columns. Some convolution masks reported in the literature are reported below. ⎡ h1 =
1 12
⎤ 1 2 1 ⎣ 2 −12 2 ⎦ h 2 = 1 2 1
⎡ 1 20
⎡ h4 =
1 8
⎤ −2 1 −2 ⎣ 1 4 1 ⎦ h5 = −2 1 −2
⎤ 1 4 1 ⎣ 4 −20 4 ⎦ h 3 = 1 4 1
⎡ 1 4
⎤ −1 0 −1 ⎣ 0 4 0 ⎦ −1 0 −1
⎡ 1 8
⎤ 1 1 1 ⎣ 1 −8 1 ⎦ 1 1 1
(1.39)
The Laplacian filter has been applied to the same test images used for gradientbased filters for edge detection (see Fig. 1.21). In particular, the images were filtered through Eq. (1.36) with the mask h(i, j) given by (1.35) and with the filters with
1.12 Laplacian Operator
29
Fig. 1.20 2D representation of the Laplacian filter transfer function approximated to the differences (filter mask h(i, j), (1.35))
Laplacian Filter H(u,v)
6
Magnitude
5 4 3 2 1 0 1 0.5
1 0.5
0
0
−0.5 Fy
−0.5 −1
−1
Fx
8-neighbors h 3 , h 4 , and h 5 . Masks can also be used with inverted signs without the results being modified. It is noted that compared to filters based on the gradient with Laplacian filters, the filtered image has only the edges and discontinuities losing the background of the original image. The original image can be subtracted from the Laplacian image to obtain a final image g(x, y) to better visualize the background of the original image overlay with the image of the edges (see Fig. 1.21l). g(x, y) = f (x, y) − ∇ 2 f (x, y)
(1.40)
The localization of the edges is influenced by the noise and by the local variation of the intensity values of the image. In the presence of a step change, the following occurs: 44449999 2 0005-5000 ∇ f 44449999 0005-5000 ⇒ 44449999 0005-5000 which highlights how the edge is in the middle between the two adjacent pixels [· · · 5| − 5 · · · ]. In this case, the edge location can be calculated with the sub-pixel resolution interpolating between the left and right pixels of the zero crossing. In the presence of a ramp change (as shown in Fig. 1.19), the following occurs: 1113555 2 0020-200 ∇ f 0020-200 1113555 ⇒ 0020-200 1113555 In this case, the edge is located at the zero crossing pixel. From the examples, we can deduce that the zero crossing of the Laplacian operator are always located at the pixel level but are often identified as a transition between positive and negative pixels and vice versa.
30
1 Local Operations: Edging Filter h
a)
b)
c)
d)
e)
f)
g)
h)
i)
l) img -
Δ
Fig. 1.21 Edge detection with the Laplacian filter: a and f Test images; the Laplacian filters h (see mask 1.35), h 3 , h 4 ,and h 5 (see masks 1.39) are applied to the test images; l image obtained by subtracting from the original image a the Laplacian image b by using Eq. 1.40
1.13 Laplacian of Gaussian (LoG) The detection of the edges with the Laplacian filter is inefficient considering the strong sensitivity of this filter to noise. Marr and Hildreth [7,8,11] have defined an operator for the extraction of the edges called the Laplacian of Gaussian (LoG) combining the effect of the Gaussian filter (to attenuate the noise) with that of the Laplacian operator (to enhance the edges). In the continuous, if with f (x, y) we indicate the input image and with h(x, y) the Gaussian function, the Laplacian of Gaussian operator g(x, y) is obtained as follows: (1.41) g(x, y) = ∇ 2 {h(x, y) f (x, y)} that points out the combined operation of the Laplacian applied to the result of the convolution between the image f and the impulse response of the Gaussian filter h. Recalling that the Gaussian filter is given by − (x
2 +y 2 )
(1.42) h(x, y) = e 2σ 2 with the standard deviation σ √ which controls the smoothing level. To simplify, the normalization expression 1/σ 2π has been omitted. Applying the derivative rules for the convolution (linear operator), Eq. (1.41) can be written in the form: (1.43) g(x, y) = {∇ 2 h(x, y)} f (x, y) = h LoG f (x, y) where h LoG is the impulse response of the Laplacian of Gaussian. Equation (1.43) is motivated by the derivative property of the convolution as follows: d d f (τ )h(x − τ )dτ [h(x) f (x)] = dx dx (1.44) d d = f (τ ) h(x − τ )dτ = f (x) h(x) dx dx
1.13 Laplacian of Gaussian (LoG)
31
This justifies a separate calculation before of the h LoG filter, the Laplacian of the Gaussian, and then performs the convolution with the image f (x, y) with Eq. (1.43). The h LoG filter is obtained by derivative of the Gaussian function (Eq. 1.42) as follows: ∂ ∂ − (x 2 +y2 2 ) x − (x 2 +y 2 ) (1.45) h σ (x, y) = e 2σ = 2 e 2σ 2 ∂x ∂x σ and the second partial derivative x 2 − (x 2 +y2 2 ) 1 − (x 2 +y2 2 ) x 2 − σ 2 − (x 2 +y2 2 ) ∂2 2σ 2σ (x, y) = e − e = e 2σ h σ ∂2x σ4 σ2 σ4 Similarly, we get for the y component:
(1.46)
y 2 − σ 2 − (x 2 +y2 2 ) ∂2 (x, y) = e 2σ (1.47) h σ ∂2 y σ4 This means that the impulse response of the LoG operator is defined as follows: h LoG (x, y) = ∇ 2 hσ (x, y) 2 2 − (x +y ) = ∇ 2 e 2σ 2 =
∂2 h (x, ∂2x σ
=
x 2 +y 2 −2σ 2 − e σ4
y) +
∂2 h (x, ∂2 y σ
y)
(1.48)
x 2 +y 2 2σ 2
whose graphical representation in the shape of an inverted Mexican hat together with the transfer function is shown in Fig. 1.22. From the 2D transfer function of the LoG filter is noticed a deviation of the isotropicity characteristic toward the high frequencies. The LoG filter for the smoothing action performed by the Gaussian component (central lobe) behaves like a low-pass filter followed by the action of the gradient component of the second order (Laplacian, lateral lobes) behaves like a high-pass filter. Therefore, the two combined actions of the LoG filter correspond to that of a band-pass filter as evidenced by the graph of its transfer function (Fig. 1.22). The essential characteristics of the LoG filter (Eq. 1.43) are summarized as follows: 1. The Gaussian filter being a low-pass filter with Gaussian trend attenuates the noise present in the image and produces an impulse response still of the Gaussian type (see Sect. 9.13 Vol. I). It is known that this operator is realized by carrying out the convolution of the image f with the Gaussian filter h with the effects controlled by the standard deviation σ . 2. The Laplacian operator being a filter that accentuates the high frequencies has the effect of improving the sharpness of the edges. 3. The overall effect of the h LoG filter is to combine the effects of the Gaussian and Laplacian filter by approximating the weights of the LoG convolution mask in the discrete. Figure 1.23 shows the 1D profiles of the Gaussian filter for the smoothing action, the filter of the first derivative of the Gaussian, and the LoG filter, all of which have been obtained with σ = 1. In particular, applying the
32
1 Local Operations: Edging Filter impulse response LoG - sigma=1 0
0.5 0
−0.5
−0.5 −1
−1 −1.5 −2 4
−1.5 2
a)
4 2
0
0
−2
y
−2
−2 −4
−4
5
x
10
15
20
25
30
35
40
x 1D LoG Transfer function
LoG Transfer function 0.2
−0.1
0
−0.2
−0.2
−0.3
−0.4
−0.4
−0.6
−0.5 −0.8 40
b)
−0.6 30
40 30
20
20
10
v
10 0
0
u
−0.7 5
10
15
u
20
25
30
Fig. 1.22 2D and 1D LoG filter: a in the spatial domain and b in the frequency domain
LoG operator in the regions of the image with homogeneous gradient will result in zero for any value of the gradient and assuming that the sum of the weights of the LoG mask is zero. In the presence of discontinuity, the operator produces increasing values when the h LoG mask reaches the first end of discontinuity of f . Subsequently, the operator reverses the trend and begins to decrease until it becomes zero when the mask is centered on the point of greatest discontinuity that corresponds with the peak of the first derivative of f . Continuing the convolution process, the operator reproduces symmetrical but reversed sign results (see Fig. 1.19). The edge point is determined by locating the zero crossing of the operator’s response for a zone of the image (not to be confused with the zero-value responses due to homogeneous image zones). 4. The localization of an edge can be determined by the resolution of the sub-pixel by linear interpolation. The edge position is not affected by the variance σ 2 . In conditions of images with little noise, the convolution mask can only affect the central pixel and a few pixels in its vicinity. If we are in the presence of noise, the component of the smoothing filter (graphically represented by the central lobe of the filter) must have a more extensive influence. This is obtained considering σ larger and consequently, the area of interest close to the zero crossing becomes
1.13 Laplacian of Gaussian (LoG) Fig. 1.23 Impulse responses 1D of the Gaussian smoothing filter, of the first derivative of the Gaussian—Dr oG and of the Laplacian of the Gaussian—LoG in the spatial domain for σ = 1
33
s: Gaussian, DroG, LoG 2.5
Gaussian
2 1.5 1
DroG
0.5 0 −0.5 −1 −1.5 −2
hLoG(rzero)=0 rzero = σ√2
LoG
Sigma=1 10
20
30
40
50
60
70
80
x more extended without overlapping the area of influence of another zero crossing, that is, of another edge. With an extended convolution mask, the operator’s response produces leveled values but with an attenuation of the amplitude and gradient in the region where zero crossing will be sought. 5. The determination of the zero crossing points in the g LoG image is done by analyzing each pixel with its neighbors included in a 3 × 3 window, checking if two neighboring pixels in opposite direction have the opposite sign and that their difference is above a certain threshold (to be defined in relation to the application context). Positive values of LoG implies a possible edge point from the dark side of the contour, for negative values, it affects the light side of the contour. This was the approach suggested by the authors of the LoG filter. A compromise must be reached, in choosing the size of the mask, remaining within the limits of the size of the edge transition zone. Normally, this operator is applied with different mask size values. From the various answers of the operator, for different values of σ , it is possible to extract the edges that have a real interest in the image. A technique that is normally used consists in localizing with a good approximation the edges with very small filtering masks to detect the contours of small geometric details present in the image but with the risk of highlighting spurious details due to noise (small convolution masks mean that the value of σ is small with smoothing limitation). Alternatively, we can use filters with larger masks (with larger scale factor) obtaining in this way more reliable results on the authenticity of the edges but with the poor approximation in the localization of the edges (for very large σ , the edges are shifted and in some cases, near edges are like a single edge). In relation to the type of application, a good strategy is to apply the LoG filter with different values of σ (see
34
1 Local Operations: Edging Original image
LoG - σ=0.5
LoG - σ=1
LoG - σ=3
LoG - σ=5
Fig. 1.24 Edge detection with the LoG filter applied on the same image with σ different, thus obtaining a sequence of images filtered with details at various scales: σ = .5, 1, 3, 5
Fig. 1.24) and analyze the images obtained starting from those with a higher value of σ (with more robust edges compared to noise although with poor spatial resolution) and then continue the analysis on the other filtered images with a smaller value of σ (i.e., with edges with better spatial resolution). This approach is also known as coarse to fine. Returning to the Marr–Hildreth filter h LoG , introducing the variable r 2 = x 2 + y 2 where r measures the distance from the origin (circular symmetry characteristic of the filter), the expression (1.48) of the filter becomes 2 2 r 2 − 2σ 2 − r 22 r − r2 2σ = c 2σ e − 1 e (1.49) h LoG (r ) = σ4 2σ 2 where c = 2/σ 2 is a normalization constant. As in the Gaussian filter, also for h LoG , the parameter σ controls the dimensions of the filter, and the points closest to the origin of this filter, where the curve crosses the zero (the first time) are in correspondence of r2 =1 (1.50) 2σ 2 that is when √ (1.51) r zer o = ±σ 2 From the graph in Fig. 1.22a of the filter, it is observed that the curve assumes the minimum value c = 2/σ 2 at the origin, i.e., r = 0. For r = ±σ , we have that the
1.13 Laplacian of Gaussian (LoG)
35
LoG - σ=0.5
Original image
LoG - σ=0.5 Salt & pepper noise 0.1
LoG - σ=1
Original image
LoG - σ=1 Salt & pepper noise 0.1
LoG - σ=0.5
LoG - σ=1
Gaussian noise .01
Gaussian noise .01
Fig. 1.25 Edge detection with the LoG filter. Two convolution masks, respectively, of 5 × 5 dimensions were used with σ = 0.5 and dimensions 9 × 9 with σ = 1; these filters were applied to the original image with Gaussian noise (σ = 0.01) and 1% salt and pepper noise
1
function is truncated to − 21 c · e− 2 value. Basically, σ controls the influence of the pixels that are at a distance kσ from the pixel in processing. With good approximation, it can be considered that the filter has no influence on the distant pixels 3r zer o from the origin where h LoG is assumed to become null definitively. To also limit the effects of truncation, the h LoG convolution mask can be discretized in a L × L window with the minimum L value defined by √ L = 2 × 3r zer o = 2 × 3 2σ A 5 × 5 LoG mask is shown below:
h log5
⎡
0 ⎢0 ⎢ =⎢ ⎢1 ⎣0 0
0 1 2 1 0
1 0 2 1 −16 2 2 1 1 0
⎤ 0 0⎥ ⎥ 1⎥ ⎥ 0⎦ 0
(1.52)
Also for the comparative test of the LoG filter, it was applied with the same test images used for the filters based on the gradient and Laplacian (see Figs. 1.25 and 1.26). In particular, the original images were filtered first and the same with simulated salt and pepper and Gaussian noise similar to that of the tests with the other filters based on the gradient and Laplacian.
36
1 Local Operations: Edging LoG - σ=0.5
Original image
LoG - σ=0.5
LoG - σ=1
Original image
LoG - σ=1
Salt & pepper noise 0.1
Salt & pepper noise 0.1
LoG - σ=0.5
Gaussian noise .01
LoG - σ=1 Gaussian noise .01
Fig. 1.26 Edge detection with the LoG filter. The filtering characteristics are identical to Fig. 1.25 only the original image changes
From a qualitative analysis, it can be said that the LoG operator has better results than the Laplacian (which tends to amplify the noise) even if both are based on the gradient of the second order. Compared to operators based on the gradient of the first order, it extracts more directly the candidate pixels as contour elements and attenuates the noise modulated adequately through the value of σ with the smoothing action carried out by the Gaussian central lobe of the filter. As the Laplacian filter is isotropic up to certain high spatial frequencies. Normally, the LoG operator, for the implementation, expects a single convolution, that is, to use the h LoG mask of L size and with weights appropriately calculated analytically by the impulse response function of the filter (Eq. 1.49). The computational complexity of the LoG filter with a convolution mask of size L × L is of L 2 multiplications and additions for each pixel. Implemented as a separable convolution (with 4 1D convolution masks), the calculation is reduced to 4 · L multiplications and additions per pixel. This difference in computation grows as the mask gets larger.
1.13 Laplacian of Gaussian (LoG)
37
In conclusion, the Laplacian of Gaussian is an operator for the extraction of the edges that combines a smoothing activity (Gaussian filter) and a activity of enhancement (Laplacian filter) realized with the zero crossing, which are essential to locate the same edges. Some models developed by Marr on the functioning mechanisms of the human visual system have shown a functional similarity to the Laplacian of Gaussian operator. Low-level vision functions (retinal operations) are modeled by the convolution of the retina image with the LoG filter with variable extension.
1.14 Difference of Gaussians (DoG) A good approximation of the LoG filter is obtained with the DoG operator based on the difference of two Gaussians with different values of σ as evidenced by the same Marr–Hildreth [8,11]: h DoG (r, σ1 , σ2 ) = h G (r, σ1 ) − h G (r, σ2 )
(1.53)
where h G (r, σ1 ) = Ae
−
r2 2σ12
and
h G (r, σ2 ) = Be
−
r2 2σ22
(1.54)
with A > B and σ1 < σ2 . Given the image f (x, y) and the filter h DoG , the filtered image g(x, y) is obtained with the convolution operator: g(x, y) = [h G (r, σ1 ) − h G (r, σ2 )] f (x, y) = h DoG f (x, y)
(1.55)
Marr and Hildreth have found that a better approximation to the LoG filter occurs if the ratio between standard deviations is as follows: σ2 = 1.6 (1.56) σ1 By virtue of the properties of separability and cascadability of Gaussians that can be applied to the DoG filter, so we can achieve efficient implementation of the LoG operator. Figure 1.27 shows the impulse response of the DoG filter calculated according to Eq. (1.56). Note the equivalence of the 3D shape with the LoG filter. In Fig. 1.27b, the 1D graphs of the two Gaussians (Eq. 1.54) are instead represented, which generated the impulse response of the DoG filter Eq. (1.53), while in Fig. 1.27c, the corresponding Gaussian transfer functions and DoG filter in the frequency domain are displayed, confirming its characteristic shape of band-pass filter as already highlighted in Fig. 1.22 for the LoG filter. Biological evidences [2,10] have shown that the functioning mechanism of the visual system, in particular the low-level vision function, carried out by photoreceptors on the retina is modeled with filtering at various scales in analogy with the DoG filter.
38
1 Local Operations: Edging 6
60
5
50
4
40
−3
x 10 2 0
h(x,y)
−4 −6 −8
hDoG
H(u,v)
−2 3
30
2
20
1
10
−10 −12 30 25
20
y
10 5 0
0
hDoG
10 0
a)
0
20 15
−1 −0.5
x
b)
0
0.5
−10
x
0
10
c)
20
30
40
50
60
u
Fig. 1.27 Difference of Gaussians filter DoG: a 3D representation of the DoG filter; b the Gaussians who generated DoG filter and its 1D representation; c 1D representation in the frequency domain of Gaussians and DoG transfer functions
1.15 Second-Directional Derivative Operator We have already considered the operator to the first directional derivative (gradient operator) to calculate the gradient of f along a r direction at an angle θ from the positive direction of x axis which was given by ∂ f ∂x ∂ f ∂y ∂f = + = f x cos θ + f y sin θ ∂r ∂ x ∂r ∂ y ∂r
(1.57)
Similarly, the orientation of the edges can be calculated by searching for zero crossing with the second derivative along r for each value of θ . From Eq. (1.57) differentiating it is obtained: ∂ ∂ ∂2 f ∂2 f ∂2 f ∂2 f = cos2 θ + 2 sin2 θ f X cos θ + f Y sin θ = sin θ cos θ + ∂r 2 ∂r ∂r ∂x2 ∂ x∂ y ∂ y2
(1.58)
At varying θ , all the possible zero crossing are calculated. Unlike the Laplacian operator, the operator based on the second-directional derivative is a nonlinear operator. It is also recalled that the Laplacian is an operator based on the second derivative, but spatially isotropic with respect to the rotation that determines only the existence of the border. To make the detection of the edge orientation more effective, it is useful to evaluate the direction in the discrete with the first derivative and then approximating with Eq. (1.58) always in the discrete. The edge direction can be evaluated during the zero crossing localization phase. In the artificial vision, the Laplacian operator and those based on the seconddirectional derivative are often not used, because the double differentiation (first and second derivative) is significantly influenced by the noise present in the real images. In practice, even in the case of slight peaks in the first derivative, zero crossing (unreliable) is generated in the second derivative. We have previously examined the LoG operator who partially mitigated this problem thanks to the Gaussian filtering component. More generally, the effect of the noise can be attenuated, using more robust filtering techniques (as proposed by Haralick), approximating the image f (x, y) with two-dimensional polynomials, from which one directly derives the second derivative
1.15 Second-Directional Derivative Operator
39
analytically. This method is called the facet model[4]. These methods of approximating the gray levels of the image with an analytical function are at high computational cost and do not offer remarkable results.
1.16 Canny Edge Operator Previously we looked at different operators for the edge extraction. These operators differ in relation to their performance in terms of the accuracy of the edge location and the ability to filter noise. Almost all the operators presented process a sampled image to estimate the maximum value of the gradient locally. The majority of the operators described are derived in a heuristic way. This is because of the difficulty of modeling in a general way the discontinuities present in the real images that normally have already undergone a process of smoothing in the imaging phase by the optical system and the digitizing system. The presence of discontinuity in the image, caused by the fluctuation of the sensors and the associated capturing electronics of the camera, produce local variations of intensity even in areas of the image that had to be homogeneous (scattered edges and artifacts). These discontinuities due to noise cannot be easily filtered because they are often not modeled with respect to the functional characteristics of the filter. An operator that approximates the gradient measurement to extract the edges must solve two opposing characteristics: (a) Attenuate the noise; (b) Locali ze edges as accurately as possible. If you exaggerate with the first characteristic, you have problems locating the edges because the smoothing process can add uncertainty about the position of the maximum discontinuity value. If you exaggerate with the second characteristic, you have the drawback of increasing the operator’s sensitivity and extracting edges that are actually artifacts to be attributed to the noise. A linear operator based on the first derivative the Gaussian Dr oG is a good compromise between locating the edges and not strong dependence on noise. This operator, described in Sect. 1.11, combines Gaussian filtering and gradient estimation (based on finite differences), and is spatially non-isotropic. The canny operator [1] proved to be optimal for solving the compromise between accurate edge localization and noise influence. This operator, from the functional point of view, is of the DroG type, and achieves the optimal conditions of the signalto-noise ratio and the location of the edges. The characteristics of the Canny operator are summarized as follows:
40
1 Local Operations: Edging
(a) Good edge detection, the signal–noise ratio (SNR) of the gradient is maximized to obtain the least probability of error in determining a real edge and obtaining a minimum probability in considering a false edge as good. (b) Good localization, points identified as edges are as close as possible to the center of the actual edge. (c) Single response, the operator should produce a single answer for the same edge. Let’s now see how the optimal characteristics proposed by Canny can be described mathematically. The first and second characteristics can be satisfied by considering the Dr oG operator, the derivative of the Gaussian, as an operator that best approximations the signal-to-noise ratio and location. The first step of the Canny operator is to convolve the image f (i, j) with a Gaussian smoothing filter h(i, j; σ ) producing an image g(i, j) given by g(i, j) = f (i, j) h(i, j; σ )
(1.59)
where σ controls the smoothing level. Subsequently, for each pixel (i, j) of the image g, the following quantities are estimated: the partial derivatives gx and g y , and the magnitude M(i, j) and orientation θ (i, j) of the gradient. The partial derivatives are approximated, using for every pixel (i, j) 2× windows, and considering the first differences, we have gx (i, j) ∼ = [(g(i, j + 1) − g(i, j)) + (g(i + 1, j + 1) − g(i + 1, j)] /2 (1.60) g y (i, j) ∼ = [(g(i, j) − g(i + 1, j)) + (g(i, j + 1) − g(i + 1, j + 1))] /2 (1.61) The magnitude and orientation of the gradient are calculated from the known formulas (see Sect. 1.2): (1.62) M(i, j) = gx2 + g 2y
g y (i, j) θ (i, j) = ar c tan gx (i, j)
(1.63)
The values of M(i, j) and θ (i, j) can be pre-calculated previously and tabulated in a look-up table to optimize the processing time of the operator. In addition, integer arithmetic can be used to approximate the angle of orientation of the gradient without penalizing the operator’s results. We are now interested in calculating the position of the edge, its orientation and possibly considering it as a potential edge in relation to the gradient magnitude M(i, j). One way to locate the points of the edges would be to consider points of the image gradient M(i, j) with high values. This approach would not be robust to identify the edges because the gradient image has the effect of accentuating the areas with the discontinuity of the gray levels and consequently in the image gradient M(i, j), the problem of the location would be reduced to finding the points of local maximum. What strategies to use? The Marr–Hildreth operator [8] locates the edge at zero in the second derivative (remember that it does not provide the edge direction because it is based on the
1.16 Canny Edge Operator
41
Laplacian operator with circular symmetry). The Canny operator instead cancels the second directional derivative to find the orientation of the maximum gradient value and then the direction orthogonal to the edge. This is expressed by the equation: ∂g y ∂gx ∂2g = cos θ + sin θ = 0 (1.64) ∂r 2 ∂r ∂r where r is the direction with respect to which the second derivative of the smoothed image g is calculated, for any value of the angle θ comprised between the direction r and the horizontal axis of the x. We want to emphasize that the localization of the edges is not a very simple task even in the presence of well-defined edges because the second derivative of the smoothed image g does not always result in zero value. A good strategy can be to consider the trend of pixels locally to look for the zero values of the second derivative. This is feasible because in each pixel is known the normal direction rn to the edge. The search for the edge location is then realized by considering a window of adequate size centered on the pixel in elaboration to analyze the pixels neighboring along the normal to the edge (see Fig. 1.28a). In this way, when the second derivative sign, along the pixels of the normal, changes, is considered the location of the border at the pixel where the second derivative assumes smaller absolute values. A location estimate can be computed by linear interpolation or approximating a limited number of neighboring pixels if the adjacent pixels in relation to the gradient direction do not fall on the pixel of the image but are localized at the sub-pixel level (see Fig. 1.28a). In the latter case, one could estimate the location of the border with the accuracy of the sub-pixel. The idea of using the second derivative of the smoothed image g still involves problems in the case of very noisy images and with remarkable texture. A valid alternative is to repeat the previous procedure but considering only the information associated with the gradient (i.e., partial first derivatives and their manipulations). Returning to the calculation of the edge location, the problem is reduced to finding the local maxima in the gradient magnitude image in the direction perpendicular to the edge where the second directional derivative is zero. To identify the edges, the peaks of the gradient magnitude should be thinned out so that there is only one point where the local minimum occurs. This process is called Non-Maximum Suppression (NMS) which aims to thin the peaks in M(i, j) by placing zero in all points of M(i, j) along the direction θ (i, j) of the gradient which are not peak values (see Fig. 1.28b). At the end of this process, all the zones in M(i, j) with local maxima have been thinned with peaks of only one pixel, producing in output the image: N (i, j) = nms {M(i, j); θ (i, j)} Nonzero pixels in N (i, j) correspond in the source image to zones with discontinuity of gray levels. However, the non-maximum suppression image N (i, j) contains several false edges due to the noise and high texture that could be present in the original image f , although initially smoothed with a Gaussian filter h(i, j; σ ). Normally, these unwanted structures have low gradient values. The original Canny operator
42
1 Local Operations: Edging
i
i
M(i,j) p r
N(i,j)
q
i
9
5
5
5
5
5
5
8
6
7
5
4
5
j
j
j j
9
7
8
7
9
3 5
NT2(i,j)
i
NT1(i,j)
9
4
3
3
3
N(r) = M(r) se M(q) < M(r) < M(p) a)
b)
c)
d)
Fig. 1.28 Non-maximum suppression and hysteresis thresholding: a Search for local maxima in the gradient image M(i, j) for each pixel (i, j) in the direction of maximum gradient given by θ(i, j) discretized with 8-neighbors; b N M S image with the pixel-edge elements identified by the non-maximum suppression process, the arrows indicate the direction of the gradient while the numbers the magnitude; c Str ong edges image N T2 generated by the image N M S with hysteresis thresholding, interrupted traits of the edge are observed; d W eak edges image N T1 in which the search for interrupted edges is continued. Black edge traits are str ong edges included only to highlight the correspondence to weak edges (drawn in red and dashed)
calculated the direction of the gradient in a different way. Remembering that g is the smoothed image, the direction rn normal to the edge is estimated as follows: rn =
∇g(i, j) M(i, j)
where ∇ represents the standard gradient of the g function, which has the two vectorial components gx and g y corresponding to the first partial derivatives with respect to the x and y coordinate axes; M(i, j) is the gradient magnitude as defined above by Eq. (1.62). A measure of edge robustness can be expressed by aligning it with the gradient magnitude M(i, j) of the smoothed image g. A method to reduce false edge fragments present in N (i, j), is to apply them an appropriate threshold. All values of N (i, j) below a certain threshold value have changed to zero. The new image of N (i, j) thus obtained is actually the source image f (i, j) with intensity values coincident with the gradient magnitude. This image can be considered the final result of the Canny operator (algorithm) that extracted the edges present in the image and the intensity values in this case give an estimate of the robustness of the identified edges. In relation to the considered threshold value T , the final image may still be present with fragments of false edges (still low threshold which has revealed edges with a low ratio of the noise signal) or fragments of missing edges (too high threshold, which has caused the filtering of real edges with high signal-to-noise ratio). This problem can never be completely eliminated due to, for example, possible shadows present in the original image that attenuate the intensity levels in some areas of the object contour. A heuristic approach can be used to mitigate this problem and consists in producing several N M S images N1 (i, j), N2 (i, j), . . . , Nm (i, j) from the image N (i, j),
1.16 Canny Edge Operator
43
applying different thresholds T1 , T2 , . . . , Tm . The Canny algorithm uses a strategy with two thresholds, also called thresholding with hysteresis. In particular, it defines a high threshold T2 and a low threshold T1 (T1 < T2 ) generating from the image N (i, j) two new images: N T2 (i, j) = N (i, j) N T1 (i, j) = N (i, j)
with N (i, j) ≥ T2 with T1 ≤ N (i, j) ≤ T2
(1.65) (1.66)
where N T1 and N T2 are the images that, respectively, include the edges marked as weak between the two thresholds and strong the edge pixels with equal or greater values of the high threshold. In essence, the pixels below the low threshold are suppressed. Str ong marked pixels of the N T2 image are assumed as certain edge elements and can already be included in the final image. Pixels that are marked as weak in the N T1 image will be considered edge elements and included in the final image only if they are connected to str ong pixels of N T2 . The procedure involves to analyze the str ong pixels in the N T2 image and as soon as it finds a interruption on the contour in processing (see Fig. 1.28c), search in N T1 using the 8-adjacency to find edges (using the edge direction information previously calculated) that can be connected to the interrupted contour of the N T2 image (see Fig. 1.28d). The algorithm continues in this way (contour tracking) trying to identify in N T1 only the missing edges in N T2 , with the goal of filtering the false ones most present in N T1 . Completed the contour tracking process, attempting to complete the broken contours in N T2 with the weak but compatible traits in contiguity with those of N T1 , we have the final binary image consisting of pixel-edge elements and non-edge pixels. An additional contour connection process could be applied to the final binary image based on heuristic approaches that take into account the application context. Returning to the final result of the Canny algorithm, we observe that it has been conditioned by several parameters: choice of σ of the Gaussian filter and of the hysteresis threshold. The parameter σ controls the smoothing level of the original image for noise suppression, but in contrast eliminates some details of the image which are then the cause of contour interruption (see Fig. 1.29). The thresholds with the hysteresis allow a greater opportunity and flexibility than the single threshold, but must be chosen appropriately. A possible choice of T2 with a too high value could exclude significant edge elements, conversely, with too low value, false edges are included. There is, therefore, no solution that automatically calculates the optimal parameters if not to find it experimentally through various attempts to choose these parameters on the basis of the application context. In fact, the results of any operator are intrinsically dependent on the type of image under consideration and the desired result type. The compromise between the highlighted edges, their localization, and the undesired spurious structures is controlled by the user with the available parameters of the algorithm chosen in relation to the application context.
44
1 Local Operations: Edging Original image
Canny - σ=0.5
Canny - σ=1.41
Canny - σ=2
Canny - σ=3
Fig. 1.29 Detection of edges with the Canny algorithm applied to different scales in analogy to the application of the LoG filter (Fig. 1.24) using the same test image with σ different, thus obtaining a sequence of images filtered with details at various scales: σ = 0.5, 1.41, 2, 3
1.16.1 Canny Algorithm The essential steps of the Canny algorithm can be summarized as follows: 1. Apply the convolution (Eq. 1.59) to the original image f (i, j) with the Gaussian smoothing filter h(i, j; σ ) obtaining the smoothed image g(i, j) with the attenuated noise that is controlled by the standard deviation σ . 2. Calculation of the following quantities (images) for each pixel of the smoothed image g(i, j): (a) gx (i, j) first partial derivative (horizontal component) using Eq. (1.60); (b) g y (i, j) first partial derivative (vertical component) using Eq. 1.61; (c) M(i, j) gradient magnitude (using Eq. 1.62); (d) θ (i, j) gradient direction (using Eq. 1.63); 3. Apply the procedure of Non-Maximum Suppression (NMS) to the gradient magnitude M(i, j) resulting in the image N (i, j) containing the edges located along the gradient and the pixel values represent the robustness of the edges present in the original image f (i, j) expressed in terms of gradient magnitude. 4. Activate a heuristic procedure that uses appropriate thresholds for the image N (i, j) to mitigate the presence of false edges and to link broken contours. 5. Repeat the previous 4 steps with increasing values of σk by obtaining the edge images N (i, j; σk ) to have different results of the Canny operator.
1.16 Canny Edge Operator
45
6. Integrate, if necessary, the results obtained with N (i, j; σk ) at different scales in relation to the application context. The formal steps of the Canny operator are the first 4. Step 3 can be replaced by locating the edges at the zeroes of the second directional derivative. The Canny operator is the most complex of the edge extraction operators considered and seems to offer the best performance even if it does not completely solve all the problems. The algorithm was applied to identical test images used for other edge operators. A qualitative comparison shows a good performance by the Canny algorithm compared to that of Sobel and LoG as shown in Fig. 1.30. The Canny algorithm is more robust to detect details of the image by better localizing the edges and minimizing the artifacts (filters noise better not based on the gradient of the second order). It does not need thinning the detected edges (through additional thinning process) even if it requires several calibration steps to find the optimum thresholds. For space reasons, the different results obtained with the Canny filter are not reported for very small values of σ (for example 0.01), especially for the peppers image (without noise) to highlight real details due to the curvature of the objects and reflections of light. A single value of σ = 1.4 was used in order to compare the results with the other operators. The algorithm has also been tested on images with noise, under the same conditions as the other edge detectors, see Fig. 1.31. Showed good results in particular
Sobel threshold=0.132
LoG - σ=1.41 threshold=0.010
Canny - σ=1.41 thresholds=0.05-0.14
Sobel threshold=0.2511
LoG - σ=1.41 threshold=0.022
Canny - σ=1.41 thresholds=0.11-0.28
Fig. 1.30 Application of the Canny algorithm and comparison with the results of the Sobel filter and LoG
46
1 Local Operations: Edging Canny - σ=3 Noise S&P 0.1
Canny - σ=1.41 Noise Gauss. 0.01
Canny - σ=3 Noise Gauss. 0.01
Noise S&P 0.1
Noise Gauss. .01
Canny - σ=1.41 Noise S&P 0.1
Noise S&P 0.1
Noise Gauss. .01
Thresholds = 0.06-0.16
Thresholds = 0.11-0.28
Fig. 1.31 Application of the Canny algorithm with σ = 1.41 and σ = 3 on the test images with Gaussian noise and noise salt and pepper
on images with 1% Gaussian noise. The parameters characteristic of the algorithm, thresholds and standard deviation σ , must be adequately adapted to the application text. In the tests performed, for example, for the houses image, the best result is obtained with σ = 1.4 and for the peppers image with σ = 3. For all the operators used, S&P noise, as expected, was Resistant, while for the Canny algorithm with σ = 3 gave good results. In general, the operators examined regardless of their complexity all suffer more or less of the same problems. They have a tendency to create edges with broken and missing traits instead of being continuous, and to combine traits of edges that should instead be separated. Edge extraction algorithms with variants [9] compared to that of Canny have been studied with the resolution of the sub-pixel based on the approach of the differential geometry, formulating the computation of the N M S image in terms of derivatives of a higher order than the first one at different scales (Linderberg [6]). Another approach based on variational geometry was proposed by Haralick [3,5]. The search for new operators will certainly mitigate these problems but it will be difficult to solve completely the errors caused by the noise that generates discontinuity of intensity and artifacts that cannot be associated with the geometrical nature of the objects. Many of these discontinuities, we recall, are dependent on other phenomena such as shadows, stacking of objects, fluctuations of sensors, and electronics of the acquisition system. This will lead to the development of additional algorithms to be applied later to the edge extraction operators to define an image with edges that have their geometric consistency that best describes the contours of the objects. These algorithms will require a priori knowledge of the scene and will also use techniques based on heuristics (post-edge detector processes).
1.17 Point Extraction
47
1.17 Point Extraction In several applications, it is necessary to highlight and locate point-like discontinuities present in an image. For this purpose, it is useful to calculate the gradient image with different convolution masks, with different impulse responses h(i, j), which better accentuate the transitions at point areas. The 3×3 Laplacian convolution masks already described above can be considered: 0 −1 0 −1 −1 −1 −2 1 −2 h 2 = 18 −1 8 −1 h 3 = 18 1 4 1 h 1 = 41 −1 4 −1 0 −1 0 −1 −1 −1 −2 1 −2 to calculate the gradient image g(i, j). The well-known convolution operator can be used: g(i, j) = f (i, j) h(i, j) where f (i, j) is the input image. In homogeneous zones of the image, the result of the convolution is zero because zero is the sum of the coefficients of the masks. If the pixel being processed, where the mask is centered, is an isolated point with a higher intensity than the neighboring pixels, the value of the gradient is greater than zero. Normally, the point areas are well highlighted by the gradient image g(i, j) with the condition: |g(i, j)| > S where S is a gradient threshold value, selected appropriately, to exalt only the point areas of interest with respect to the type of application. Some disadvantages can occur when selecting the threshold S if we remember that the Laplacian gradient locates the discontinuities, not in relation to particular values of the gradient, but through the change of sign (zero crossing) of the second derivative. This problem can be mitigated as suggested by Prewitt with the following mask: 1 −2 1 1 h 4 = −2 4 −2 8 1 −2 1 Furthermore, before applying the Laplacian operator, a low-pass filter can be used to smooth the image. Finally, the point zones can also be highlighted with correlative techniques where sample windows of dimensions L × L are used containing the point zones of interest to be searched in the image. A more general solution will be considered with the matched filter approach which will be described below.
1.18 Line Extraction In several image analysis applications, in addition to the extraction of elementary structures such as edges, it is interesting to highlight more complex structures such as line segments, or more generally to accentuate the contours of an object with
48
1 Local Operations: Edging
respect to the background. Some of the described edge extraction operators can be modified to specialize them in line determination (for example, the operator of Canny, Laplacian, and gradient). In this section, some operators will be considered for the extraction of segments of lines with limited length. Other algorithms will be considered later to connect broken lines together or to disconnect incorrectly linked line segments. In the current context, we will consider specialized gradient-based operators to enhance short traits of vertical, horizontal, and diagonal lines. The gradient image is calculated by convolution of the image f (i, j) with different masks: h k (i, j) of 3×3, whose coefficients are selected to guarantee an impulse response of an operator that extracts lines in the different orientations: g(i, j) = max | f (i, j) h k (i, j)| 1≤k≤4
The masks for the extraction of lines in 4 possible directions with scale factors of 1/6 are −1 2 −1 −1 −1 −1 h 2 = 2 2 2 h 1 = −1 2 −1 −1 2 −1 −1 −1 −1 ver tical −1 −1 2 h 3 = −1 2 −1 2 −1 −1 +45◦
hori zontal 2 −1 −1 h 4 = −1 2 −1 −1 −1 2 −45◦
In the homogeneous zones of the image, the gradient is zero while in the presence of discontinuity, the pixel value is accentuated. In some cases, the line-extraction operator can be reapplied to the same gradient image to further enhance the lines extracted in the first operation. The noise effect can be attenuated by smoothing with a low-pass filter to be applied before the line-extraction operator. Furthermore, convolution masks with a size of 5×5 can be used to increase the number of line orientations (for the latter aspect, see also Sect. 1.10). Finally, also for the extraction of the lines, correlative techniques can be used to compare the original image with different windows of sample images containing segments of lines in different directions (template matching).
1.19 High-Pass Filtering While in the Chap. 9 Vol. I, we used the low-pass filters to level the pixel intensities and attenuate the noise, we will now apply the high-pass filters for the opposite activity, i.e., to detect and/or enhance the discontinuity of pixel intensity. All edge extraction operators seen above can be analyzed in the frequency domain considering the transfer function associated with each operator. In fact, all the discontinuities
1.19 High-Pass Filtering
49
present in the image (edges and noise) are associated with the high-frequency components. It follows that the edge extraction operators are high-pass type filters and can also be defined directly in the frequency domain. Convolution masks designed in the spatial domain to extract the edges had the central pixel with a greater weight than any other surrounding. This was done precisely to accentuate the discontinuity present in the image. This in the frequency domain is equivalent to an accentuation of the high frequencies. If there is no discontinuity in the image, the operator does not produce any alteration. This explains how a low-pass filter does not alter the low-frequency components. A high-pass filter in addition to enhancing the edges can also be used to improve the visual quality of the image (enhancement) by exalting structures obscured by the fog effect or by the blurring effect. High-pass filters produce the opposite effect of low-pass filters. Accentuate the high-frequency components while attenuating the low-frequency components. A high-pass filter H H (u, v) can be derived from a low-pass filter HL (u, v) and vice versa as follows: H H (u, v) = 1 − HL (u, v)
(1.67)
From the convolution theorem in Sect. 9.11.3 Vol. I, we know that the convolution in the frequency domain is the simple product of the input image and the transfer function both transformed into the Fourier domain, indicated with F(u, v) and H (u, v), respectively. Therefore, the filtered image G(u, v) in the Fourier domain is obtained by multiplying the image with the filter pixel by pixel as follows: G(u, v) = F(u, v) · H (u, v)
(1.68)
where H (u, v) indicates the transfer function that is the generic filter to be applied to the input image F(u, v). The effects of the filtering can be analyzed in the spatial domain by applying the inverse Fourier transform to the filtered image G(u, v). By virtue of the convolution theorem, there exists the equivalence of the filtering in the frequency and spatial domain, therefore, once a frequency filter is designed, then it can be implemented as space filter even if only approximated with an appropriate convolution mask. Now let’s analyze a variety of filters in the frequency domain for edge extraction and to improve in general the visual qualities of structures (edges, textures, texture, ...) present in the image associated with particular frequencies.
1.19.1 Ideal High-Pass Filter (IHPF) The transfer function of this two-dimensional filter is defined as follows: 0 l(u, v) ≤ l0 H (u, v) = 1 l(u, v) > l0 l(u, v) = u 2 + v2
(1.69) (1.70)
where l0 is the cut-off frequency and l(u, v) is the generic distance of a point (u, v) from the origin in the frequency domain.
50
1 Local Operations: Edging Impulse response H(u,v)
H(u,v)
h(i,j)
1 u 0
a)
v
b)
l(u,v)
c)
Axis (i)
Fig. 1.32 Ideal high-pass filter: they only pass the high frequencies that fall outside the circle of radius l0 and centered with the origin of the frequency domain: a 3D representation of the transfer function, b its transverse 1D profile, and c the 1D impulse response of the IHPF filter
As you can see from Fig. 1.32, this filter is the opposite of the ideal low-pass filter because it completely eliminates all low frequencies including up to the l0 threshold, while completely leaving all other frequencies outside the circle of radius l0 with the center in the origin of the spectrum (see Fig. 1.32a). In reality, this filter is not feasible. Figure 1.32 in (b) shows the transfer function of the IHPF filter and in (c) the corresponding impulse response. As expected, as the high frequencies are associated with the edges, the filter generates an exaltation of the contours. The filter has been applied to the image of Fig. 1.33a and the effects of the filter are highlighted when the cutting frequency varies. In essence, the discontinuities are more and more accentuated as the cut-off frequency l0 moves more and more toward the high frequencies. The IHPF filter results are obtained with the cutting frequencies l0 = 20, 80, 120r px (radius expressed in pixels). It shows the unwanted effect of ringing more accentuated toward low values of the cutting frequency. The effect of the ringing, already manifested by low-pass filtering (Sect. 9.13 Vol. I), is caused by the net cut-off frequencies and is confirmed by the profile of the impulsive response of the filter (see Fig. 1.32c) due to the external lobes. The darkening of the image is typical for high-pass filters because it cancels the background of the image (normally represented by low frequencies) as happened with the filters based on differentiation in the spatial domain.
1.19.2 Butterworth High-Pass Filter (BHPF) This filter has the transfer function defined as follows: 1 H (u, v) = 1 + [l0 /l(u, v)]2n
(1.71)
where l0 (cut-off distance) and l(u, v) have the same meaning as above, and n is the order of the filter. Figure 1.34 shows the 3D graph of the BHPF filter of order 2 and the transversal profile 1D up to the order 4. When l(u, v) = l0 , the transfer function H (u, v) has value 0.5 which has reached half of its maximum value. In practice, it is common to select the distance of cut-off frequencies at points for which H (u, v)
1.19 High-Pass Filtering
a) Original
51
b)
d)
c) n=2
a)
b) HP rpx20
a) Rebuilt with HP rpx20
d)
c) HP rpx40
b)
HP rpx80
d)
c) rpx40
HP rpx120
HP rpx80
HP rpx120
Fig. 1.33 Results of high-pass filters ideal, Butterworth order 2 and Gaussian. First row IHPF: a Original image, b filtering with cut-off frequency l0 = 20 px, c l0 = 80 px, and d l0 = 120 px. Second row: Butterworth filter with one more result with l0 = 40 px. Third row: Gaussian filter with rebuilt image at the same cut-off frequencies as the Butterworth filter
√ is less than 1/ 2 of its maximum value. The transfer function becomes H (u, v) =
1 √ 1 + ( 2 − 1)[l0 /l(u, v)]2n
(1.72)
In Fig. 1.33, second row, the results of the BHPF filter are reported for different cutting frequencies, from which emerges a remarkable attenuation of the ringing effect with almost zero artifacts (in this case, the contours are less distorted) already starting from low cutting frequencies. It remains the darkening of the image background typical of the high-pass filters.
52
1 Local Operations: Edging Butterworth HP Filter n=1,2,3,4 0.9 0.8 1
0.7
0.6 0.4 0.2
0.4
0.2 30 20
v
10 0 0
10
20
u
30
40
0.1
b)
n=2
0.5
0.3
0 40
a)
0.6
1 n=
n=3
H(u,v)
H(u,v)
0.8
n=4
0 50
100
150
200
250
l(u,v)
Fig. 1.34 Butterworth high-pass filter: a 3D representation of the transfer function of order 2; b its transverse 1D profile of order n = 1, 2, 3, 4
1.19.3 Gaussian High-Pass Filter (GHPF) This filter is functionally identical to the IHPF filter but differs, similarly to the BHPF filter, for the more attenuated transitions of the cutting frequencies. It has the same form of the Gaussian low-pass filter (described in Chap. 9 Vol. I) with the transfer function (derived by GLPF filter), given by H (u, v) = 1 − e−[l0 /2l(u,v)]
2
(1.73)
where l0 (cut-off distance) and l(u, v) have the same meaning as above. In Fig. 1.35, the transfer function is displayed for different cutting frequencies (l0 = 20, 40, 80, 120r px). The results of the GHPF filter are shown in the third row of Fig. 1.33. Compared to the two previous high-pass filters, the actions of the GHPF filter are less pronounced as can be seen from the 1D transversal profile of the transfer functions. Ultimately, the last two BHPF and GHPF filters are also preferred for almost eliminating the ringing effect.
1.20 Ideal Band-Stop Filter (IBSF) This filter (also called band-reject filter) operates in opposite way to the ideal bandpass filter by removing all the frequencies included in a band and leaving unchanged all the other frequencies external to the band. The transfer function is defined by ⎧ l ⎨ 1 l(u, v) < l0 − 2 l H (u, v) = 0 l0 − 2 ≤ l(u, v) ≤ l0 + l (1.74) 2 ⎩ 1 l(u, v) > l0 + l 2 where l is the bandwidth, and l0 here is the center radius of the band-stop filter. The ideal band-stop filter falls into the category of selective filters that can be efficacious
1.20 Ideal Band-Stop Filter (IBSF)
53 o
-l 1
0.8
0.8
0.7
0.6
0.6
H(u,v)
H(u,v)
0.9 1
0.4
0.5 0.4
0.2
0.3
0 40
0.2 30 20
a)
v
10 0
20
10
0
30
40
0.1
b)
u
0 250
200
150
100
50
l(u,v)
Fig. 1.35 Gaussian High-Pass Filter: a 3D representation of the transfer function for l0 = 120; b its 1D transverse profiles for cut-off frequencies l0 = 20, 40, 80, 120 px Butterworth Band Stop Filter l =57; Delta=20px
Band Stop l =57; Delta=20px
Band Stop l =57; Delta=20px
u
1 0.9
1
0.8 0.7
H(u,v)
H(u,v)
0.8 0.6 0.4
0.4
0.2
0.3
0 40 30
40
v
20
10
u
10 0
0
Image with periodic noise
d)
0.2
30
20
a)
0.6 0.5
c)
b)
0 50
100
150
200
250
300
350
400
450
500
u
Band Stop l =57; Delta=20px
e)
0.1
v
Band Stop l =57; Delta=20px
v
f)
u Band Stop l =57; Delta=20px
g)
Fig. 1.36 Ideal and Butterworth band-stop filtering application. The results of the two filters for the periodic noise attenuation present in the image are shown. a 3D graphic representation of the transfer function of the Butterworth band-stop filter and in c 1D representation of the transversal profile of the same filter; d Original image with periodic noise; e Image reconstructed with the periodic noise attenuated through the Butterworth filter and in g the image reconstructed with the ideal filter; b and f image spectrum with the superposed filter of Butterworth and ideal, respectively
when we know the frequencies associated with a repetitive and periodic image noise that is localized in some areas in the frequency domain. In Sect. 9.11 Vol.I, we have already described how it is possible to generally perform filter masks in the frequency domain using the peculiarities of the Fourier transform.
54
1 Local Operations: Edging
Figure 1.36 shows the image spectrum with the band (circular ring) which includes the frequencies to be eliminated associated with the periodic noise present in the test image. Obviously, in a complex image, it is difficult to selectively find the frequencies to be removed that are exclusively attributable to noise. In fact, some of these removed frequencies can be intrinsic to the image itself. This explains the incomplete reconstruction of the original image even if the filtering result is acceptable.
1.20.1 Butterworth and Gaussian Band-Stop Filter Selective frequency removal with the ideal band-stop filter in practice is rarely used. To improve the results of selective frequency removal, instead of filtering with step frequency cuts, you can use filters with transfer functions that allow more gradual cuts of the frequencies. This is possible using transfer functions with polynomial and exponential approximations based on the following filters of Butterworth H Br (u, v) and Gaussian HGr (u, v): H Br (u, v) =
1
1+
l(u,v)·l (l 2 (u,v)−l02 )
HGr (u, v) = 1 − e
− 21
2n
l 2 (u,v)−l02 l(u,v)·l
(1.75)
2 (1.76)
where l0 and l are the parameters that control the location and width of the frequency band to be eliminated, respectively. Recall that for the Butterworth filter, the exponent n controls the steepness of the transfer function profile for cutting frequencies at the edges of the band. Both filters help to minimize the ringing effect in the reconstructed image. In Fig. 1.36, the transfer function is shown and the results obtained with the Butterworth band-stop filter for the same image with noise to which the ideal bandstop filter has also been applied. An attenuation of the phenomenon of ringing is highlighted.
1.21 Band-Pass Filter (BPF) The band-pass filter selects, in a specified region of the spectrum, a range of contiguous frequencies (band) to be passed and attenuates or completely eliminates the frequencies outside this range. Essentially a band-pass filter attenuates very low and very high frequencies, and simultaneously maintains a band of intermediate frequencies. Band-pass filtering can be used to improve edges, suppressing low frequencies, and attenuating high frequencies where noise is usually localized (noise attenuation).
1.21 Band-Pass Filter (BPF)
0
H(u,v)
55 Rebuilt im Band pass
u
Δl=70
H(u,v)
1
80
a)
150
b)
0
Δl
v
Δl l(u,v)
c)
Fig. 1.37 Ideal band-pass filter: a image of the filter mask built with cut-off frequency l0 = 115r px and bandwidth l = 70 px; b the transversal profile of the transfer function; and c result of the filtered image
1.21.1 Ideal Band-Pass Filter (IBPF) The transfer function of the IBPF filter is given by 1 l0 − l 2 ≤ l(u, v) ≤ l0 + H (u, v) = 0 other wise
l 2
(1.77)
where l0 indicates the radius centered on the l band. Even the ideal band-pass filter (see Fig. 1.37) falls into the category of selective filters which can be effective when the frequencies associated with image details that are to be enhanced and eliminated are known. As expected, the ideal filter has the effect of ringing and blurring on the original image. In fact, the ideal band-pass filter is the simplest, but with the step-transfer function it is not feasible in the physical world.
1.21.2 Butterworth and Gaussian Band-Pass Filter In analogy to the band-stop Filters, the Butterworth and Gaussian band-pass filters can be defined. In particular, if we indicate a generic Band-Stop filter with H B S , the corresponding H B P Band-Pass filter is obtained as follows: H B P (u, v) = 1 − H B S (u, v)
(1.78)
The band-pass filter control parameters remain the same as those already seen for band-stop filters. These filters give better results in particular the Gaussian filter considering that the filter shape remains Gaussian in both the frequency and spatial domains. The ringing effects in the spatial domain of the reconstructed image are almost eliminated. As an alternative to the (1.78), the derivation of these filters can be obtained by multiplying the transfer functions of the low-pass and high-pass filters in the frequency domain. Let H B L P be the Butterworth low-pass filter and H B H P the
56
1 Local Operations: Edging Rebuilt image with
rence
u
0 band-
0.7 0.8 0.6 0.5
H(u,v)
H(u,v)
0.6
0.4
0.2
0.4 0.3
0 40
0.2 30
40 30
20
a)
v
0.1
20
10
10 0
0
u
c) 0
b)
50
100
150
200
v
250
u
300
400
350
450
500
d)
Fig. 1.38 Difference of Gaussian band-pass filter: a 3D representation of the filter realized with l1 = 150 and l2 = 40; b its representation as a gray-level image; c the cross section profile; and d the result of the filtered image
Butterworth high-pass one (both the filters have the same order n), the transfer function of the Butterworth band-pass filter would have H B B P (u, v) = H B L P (u, v) · H B H P (u, v)
with
lBL > lB H
(1.79)
where l B L and l B H are the cut-off frequencies of the low- and high-pass filters, respectively, with the constraint that the low-pass filter will have a higher cut-off frequency. In this case, the filter bandwidth is l = l B L − l B H , and the center radius BH . of the band-pass filter is l0 = l B L +l 2 Similarly, the Gaussian band-pass filter HG B P can be derived.
1.21.3 Difference of Gaussian Band-Pass Filter The difference of two Gaussians can be used to form a band-pass filter in the frequency domain. The accentuation of the high frequencies associated to a given band can be obtained in the frequency domain with an H B P transfer function expressed as the difference of two Gaussians defined with different values of the standard deviation ν, given by H B P (u, v) = H (l, ν1 ) − H (l, ν2 ) = Ae
−l
2 (u,v) 2ν12
− Be
−l
2 (u,v) 2ν22
(1.80)
with A ≥ B and the standard deviations ν 1 > ν 2 . The impulse response of this filter has already been introduced in the spatial domain, in Sect. 1.14, which we rewrite −
r2 2σ12
−
r2 2σ22
1 (1.81) νu Remember that σ x and ν u the widths of the Gaussians are inversely proportional to each other in the spatial and frequency domains related as indicated above. In this context the standard variations are indicated with ν1 = l1 and ν2 = l2 . Figure 1.38 shows the filter built by Eq. 1.80 with the two Gaussians in the frequency domain and the result of the filtering with the following value of the parameters: A = B = 1, l1 = 150, and l2 = 40. The transfer function profile is typical of a band-pass filter and high-frequency enhancement filters with the central lobe that cuts and attenuates h(x, y) = Ae
− Be
with σx =
1.21 Band-Pass Filter (BPF)
57 Original Image
Rebuilt image with the Laplacian uency domain
Enhanced Image f(x,y) -
Δ
4
x 10 0
−4 −6 −8
H(u,v)
−2
−10 −12 −14 40 30
40 30
20
a)
20
10
v
10 0
0
u
b)
c)
d)
Fig. 1.39 Laplacian filter in the frequency domain: a 3D representation of the Laplacian filter given by the Eq. (1.83); b Original image; c Result of the Laplacian filter; and d Result of the enhanced image with the Eq. (1.85)
low frequencies. The νν21 ratio controls the filter bandwidth. Experimentally, it is shown that this filter gives good results with wide bandwidth. Furthermore, it is worth mentioning the convenience of being applied both in the spatial domain and in the frequency domain once the optimal filter parameters have been defined.
1.21.4 Laplacian Filter in the Frequency Domain In Sect. 1.12, we have described the Laplace operator in the spatial domain to enhance the contours of an image. It is based on the second derivative and has the characteristic of being isotropic. In the frequency domain, it can be implemented as a filter to enhance the strong variations in intensity present in the image (image enhancement). The transfer function of the Laplacian filter is obtained by considering the Fourier transform of the second derivative of a two-dimensional function. The Laplacian function is given by Eq. 1.38 that here we rewrite F {∇ 2 f (x, y)} = −4π 2 (u 2 + v2 )F(u, v) = H (u, v) · F(u, v)
(1.82)
from which the filter transfer function is derived: H (u, v) = −4π 2 (u 2 + v2 )
(1.83)
It follows that in the frequency domain, the Laplacian of the image is given by (1.82) where F(u, v) is the Fourier transform of the image f (x, y). The Laplacian image in the spatial domain results ∇ 2 f (x, y) = F {H (u, v) · F(u, v)}
(1.84)
After appropriate normalizations of the input image f (x, y) and of the Laplacian image obtained with the (1.84), the reconstructed enhanced image g(x, y) is given by (1.85) g(x, y) = f (x, y) − ∇ 2 f (x, y) Figure 1.39 shows the Laplacian transfer function, the filtering result with the evidence of the dark image background typical of the high-pass filters, and finally, the reconstructed enhanced image subtracting the Laplacian image to the original image.
58
1 Local Operations: Edging
1.22 Sharpening Filters This category of filters accentuate the high frequencies present in an image to improve the visual qualities, making the discontinuous structures (points, lines, contours) sharper to facilitate the visual inspection of an image to the human observer. These filters are also particularly useful for image restoration and for obtaining sharp images in the printing process.
1.22.1 Sharpening Linear Filters Unlike the edge extraction operators described above, these filters accentuate the high frequencies in order to enhance the discontinuous structures and improve their visibility. With Eq. (1.39), a first improvement (sharpening) of the visual qualities of the structures of an image has already been realized. In essence, a high-pass filter (Eq. 1.85) has been applied in the frequency domain. In this domain, it is easier and rigorous to make a high-pass filtering. Also in the spatial domain, the edge extraction operators have been described, with the typical impulse response of a high-pass filter which determines the difference between the pixel in processing and its neighbors. Appropriate convolution masks (see Sect. 1.12) have been defined to produce a zero output value at areas with uniform intensities (summation of mask coefficients equal to zero). Figure 1.40a illustrates how in the spatial domain, the convolution operator achieves the intensification of the edges (sharpening). With the function f (x) is represented the profile of a rising front relative to an edge, i.e., a slight transition of intensity. The impulse response h hp (x) of a typical high-pass filter (with positive peak and two negative peaks at the ends) can be used to convolve the f (x) image from left to right. As convolution proceeds, the two lateral peaks and the main lobe of h hp (x) meet the transition present in f (x). The result of the convolution g(x) produces an exaltation of the transition accentuating the slope and generating a higher signal (overshooting) than the original one with two humps at the extremes of the transition. This is the typical effect produced by any filter that accentuates the details of the high frequencies in the spatial domain. In the frequency domain, this has already been highlighted with the ringing effects obtained in particular with the ideal high-pass filters caused by frequencies adjacent to the cut-off frequency. An enhanced image ge (x, y) in the spatial domain is obtained by first applying a high-pass filtering to the input image f (x, y) by the discrete convolution operator and then subtracting or adding the result of the convolution to the input image: h hp (0, 0) < 0 f (x, y) − f (x, y) h hp (x, y) i f (1.86) ge (x, y) = f (x, y) + f (x, y) h hp (x, y) i f h hp (0, 0) > 0 where the impulse response h hp (x, y) is normally a high-pass filter (e.g., Laplacian, see masks 1.39) and h hp (0, 0) indicates the central element of the convolution mask that for a high-pass filter can be positive or negative.
1.22 Sharpening Filters
59
f(x,y)
0
x
a)
hpa(x,y)
b) Original image
c) Enhanced image with
d) Enhanced image with
e) Enhanced image with
ge(x,y)=f(x,y) * hpa(x,y)
0
x
Fig. 1.40 Sharpening linear filters in the spatial domain: a 1D graphical representation of the sharpening operator with a typical high-pass filter; b original image; c result of the enhanced image with the filter mask (1.88); d with the filter mask h s2 and e with the filter mask h s3
A simplified version of Eq. (1.86) is obtained by recalling the approximate formula of the Laplacian (Eq. 1.33) so that, substituting, we obtain ge (x, y) = f (x, y) − ∇ 2 f (x, y) ≈ f (x, y) − [ f (x + 1, y) + f (x − 1, y) − 4 f (x, y) + f (x, y + 1) + f (x, y − 1)] = 5 · f (x, y) − [ f (x + 1, y) + f (x − 1, y) + f (x, y + 1) + f (x, y − 1)]
(1.87) from which it is possible to obtain the following sharpening mask: 0 −1 0 h s (x, y) = −1 5 −1 0 −1 0
(1.88)
and consequently, the sharpened image gs (x, y) is obtained with a single convolution operation using the isotropic filter (1.88) with directional increments of 90◦ . As for the Laplacian filter, we can use other masks with directional increments of 45◦ . The most used are −1 −1 −1 1 −2 1 h s3 = −2 5 −2 (1.89) h s2 = −1 9 −1 −1 −1 −1 1 −2 1 A characteristic of these masks is that the sum of the coefficients is always 1 to eliminate normalization problems. Figure 1.40 shows the results of the application
60
1 Local Operations: Edging
of the spatial sharpening filter. We observe the best clarity of images (d) and (e), those filtered with the filter masks (1.89).
1.22.2 Unsharp Masking Another method of improving the visual quality of an image by accentuating the high spatial frequencies is known as unsharp masking. The name of this sharpening operator derives from the fact that the improvement of the sharpness of the image is obtained by subtracting from the original image its smoothed version of the image (in fact unsharp). Given the original image f (x, y) and its smoothed version f G , a sharpened image based on an unsharp masking operator is obtained from gum (x, y) = f (x, y) − f G (x, y)
(1.90)
The theoretical bases of this method are as follows. We consider in the continuous case the following one-dimensional impulse response: h(x) = δ(x) − e
−
x2 2σ 2
(1.91)
where the first term indicates the Dirac Delta function and the second term represents a Gaussian function with variance σ 2 . If f (x) is the input signal to be convolved, the following is observed:
2 − x g(x) = f (x) ∗ h(x) = f (x) ∗ δ(x) − e 2σ 2 = f (x) ∗ δ(x) − f (x) ∗ e = f (x) − f (x) ∗ e
2 − x2 2σ
−
x2 2σ 2
(1.92)
= f (x) − f G (x; σ )
Basically, the result of improving the sharpness of the image, accentuating the high spatial frequencies, is obtained by subtracting pixel to pixels from the original image f (x), a blurred image obtained with a Gaussian filter with standard deviation σ (typical low-pass filter of smoothing f G (x, σ )). The sharpening process based on the unsharp masking approach is graphically highlighted in Fig. 1.41a where it is observed how the convolution operator (1.92) based on spatial low-pass filter achieves the intensification of the edges in a similar way to the sharpening operator (1.86) based on the high-pass filter described above (Eqs. 1.85 and 1.86). From Eq. (1.91), we can directly derive the impulsive response of a filter unsharp masking h um (x, y) considering the peculiarities of a sharpening filter, that is, must have the positive and negative coefficients with their summation greater than one and with the central pixel of the mask greater than zero. If we indicate with h lp (x, y) a generic low-pass filter, a general form of the h um (x, y) filter is given by h um (x, y) = δ(x, y) − k · h lp (x, y)
0 0
(1.96)
where the impulse response h hp (x, y) is normally a high-pass filter (e.g., Laplacian, see the masks 1.39) and h hp (0, 0) indicates the central element of the convolution mask that for a high-pass filter can be positive or negative. It is noted that parameter a controls how the low frequencies of the original image must be combined with the high-frequency components emphasized by parameter b ≥ 1, recovered from high-pass filtering. If we consider the high-pass filters Laplacian h hp (see in Sect. 1.12 filter masks 1.35 and 1.39) but with a positive central element and the filter pass-all h pa (kernel represented by the delta function δ(x, y)), a high-boosted image ghb (x, y) is obtained from the following: ghb (x, y) = a · f (x, y) + b · f hp (x, y) = a · h pa (x, y) f (x, y) + b · h hp (x, y) f (x, y) = [a · h pa (x, y) + b · h hp (x, y)] f (x, y) = h hb (x, y) f (x, y)
(1.97)
The first expression is motivated by the definition of the high-boost filter (second equation of 1.96), the second expression is obtained by considering the input image filtered with the kernel pass-all, that is, f (x, y) ← h pa f (x, y). The third expression of (1.97) is motivated by recalling that convolution is a linear operator. The highboost filter is thus obtained: h hb (x, y) = a · h pa (x, y) + b · h hp (x, y) and some examples of h hb (x, y) high-boost filters are derived as follows: 0 −1 0 0 −b 0 0 0 0 h hb4 (x, y) = a · 0 1 0 + b · −1 4 −1 = −b 4b + a −b 0 −1 0 0 −b 0 0 0 0 0 0 0 −1 −1 −1 −b −b −b h hb8 (x, y) = a · 0 1 0 + b · −1 8 −1 = −b 8b + a −b 0 0 0 −1 −1 −1 −b −b −b
(1.98)
(1.99)
(1.100)
1.22 Sharpening Filters Original images
63 Filter hhb4; a=0.2; b=2
Filter hhb8; a=0.4; b=3
Filter hhb8; a=0.4; b=4
hhb4 and hhb8
R
Fig. 1.42 Application of the high-boost filters high-boost (1.99) and (1.100) with the control parameters shown in the figure Original image
Filtro hfe; a=0.26; b=1.6
Equalized image
Fig. 1.43 Application of the h f e filter to emphasize the high frequencies using Eq. (1.103) with the control parameters a = 0.26 and b = 1.6 based on Butterworth high-pass filter with cut-off frequency l0 = 50 px
In Fig. 1.42, the results of the high-boost filtering are shown by applying the filter masks (1.99) and (1, 100) with different values of the control parameters a and b shown in the figure. The results of the filtering and the values of the control parameters are strictly dependent on the type of image. A good compromise between the brightness and the exaltation of the image details depends on the level of low frequencies suppressed (through parameter a, with values close to zero the brightness decreases) and the emphasization of the high frequencies (through parameter b with values of 2, 3, · · · ) to better highlight the details of the image (texture, edges, and contours).
64
1 Local Operations: Edging
1.22.4 Sharpening Filtering in the Frequency Domain The proposed methods of improving visual image quality can be implemented directly in the frequency domain. In this case, high-pass filters can be used in the frequency domain already described above (Butterworth high-pass, Gaussian highpass, etc.). Recalling the Eq. 1.67 (that is, the derivation of the transfer function of a high-pass filter once known the low-pass filter), the transfer function of the unsharp masking filter associated with Eq. (1.95) would result Hhp (u, v) = 1 − Hlp (u, v)
(1.101)
Basically, the low-pass filter with the high-pass filter is replaced. Similarly, the transfer function of the high-boost filter is defined, expressed by Hhb (u, v) = 1 + k · Hhp (u, v)
(1.102)
where the parameter k > 1 controls the level of emphasis of high-frequency components. A more general expression is defined to indicate the filter known as HighFrequency Emphasis—hfe: Hh f e (u, v) = a + b · Hhp (u, v)
with
a ≥ 0;
and
b > a (1.103)
where b has the same meaning as k, it controls the level of emphasizing the high frequencies, while the parameter a controls the intensity of the final image preventing the total removal of the continuous component by the high-pass filter. If a = b = 1, we get the high-boost filter back. This type of filter can be effective for images that have a limited dynamic of gray levels toward low values (for example, dark images obtained by microscope, X-rays, ...). Figure 1.43 shows the results of the h f e filter applied to a radiographic image. The high-pass filter used is that of Butterworth with a cut-off frequency of l0 = 50r px and then the high frequencies were emphasized by Eq. (1.103) with parameter values a = 0.26 and b = 1.6. Finally, the end result is shown after equalizing the histogram of the filtered image.
1.22.5 Homomorphic Filter An additional filter for the improvement of the visual quality of an image (but often used also for the removal of the multiplicative noise) is given by the homomorphic filter based on the logarithmic function. In particular, the homomorphic filter is effective for images acquired in very different light conditions. The action of the filter for these types of images must reduce the intensity in the areas of the image with highlights (make the intensity of light uniform) and at the same time improve the contrast (lighten the dark areas). From the image formation model, we know that the intensity of the image is closely related to the lighting s of the scene and to the r reflectance property of the objects. It is also known that these two components, lighting and reflectance, are combined in a multiplicative way and the formation model of the image f (x, y) can be expressed as follows f (x, y) = s(x, y) · r (x, y)
(1.104)
1.22 Sharpening Filters
65
From the phenomenological point of view, we can consider the variation of brightness as the noise to attenuate or remove. The objective of the homomorphic filter is to remove this noise in the frequency domain. This is possible if in this domain, the two lighting and reflectance components are separate and localizable. From the mathematical point of view, we know that the Fourier transform cannot be applied to Eq. (1.104) because the transformation of the product of two functions is not separable, that is, F { f (x, y)} = F {s(x, y)} · F {r (x, y)} (1.105) To make the two additive components, we introduce the logarithmic function thus obtaining z(x, y) = ln f (x, y) = ln[s(x, y) · r (x, y)] = ln s(x, y) + ln r (x, y)
(1.106)
It follows that F {z(x, y)} = F {ln f (x, y)} = F {ln s(x, y)} + F {ln r (x, y)}
(1.107)
If we indicate the Fourier transform of a function with capital letters, Eq. (1.107) can be rewritten as follows: Z (u, v) = S(u, v) + R(u, v)
(1.108)
Having separated now the two components, luminance and reflectance, it is possible to define a filter H (u, v) with the capacity to act simultaneously on S(u, v) and R(u, v) to attenuate the low frequencies and to emphasize the high frequencies. This characteristic of the filter is motivated by the fact that the lighting component generates in the image small spatial variations while the reflectance component introduces strong spatial variations (light reflected by the objects generates strong discontinuity in the contours). Therefore, the lighting component is associated with the low frequencies to be attenuated while the reflectance component is associated with the high frequencies to be accentuated. These assumptions do not constitute an exact model even if homomorphic filtering is a good approximation. If we indicate with Ho (u, v) the homomorphic filter, we can apply it to the equation(1.108) to obtain the Fourier transform of the filtered image Q(u, v) given by Q(u, v) = Ho (u, v) · Z (u, v) = Ho (u, v) · S(u, v) + Ho (u, v) · R(u, v) (1.109) To obtain the intermediate filtered image q(x, y) in the spatial domain, the inverse Fourier transform is applied as follows: q(x, y) = F −1 {Q(u, v)} = F −1 {Ho (u, v) · Z (u, v)} = F −1 {Ho (u, v) · S(u, v)} + F −1 {Ho (u, v) · R(u, v)} = s (x, y) + r (x, y)
(1.110)
Recalling that at the beginning we had applied the logarithmic function (Eq. 1.106) to separate the lighting and reflectance components, in order to obtain the final image, we have to apply the inverse function of the logarithm, that is,
g(x, y) = eq(x,y) = es (x,y)+r (x,y) = es (x,y) er (x,y) = so (x, y) · ro (x, y) (1.111)
66
1 Local Operations: Edging
f(x,y)
S(u,v)
s(x,y) r(x,y)
ln
FFT R(u,v)
s’(x,y) r’(x,y)
H(u,v)
g(x,y) exp
Fig. 1.44 Functional diagram of the whole homomorphic filtering process Original image
High frequencies emphasized a=0.5, b=2
High frequencies emphasized γL =0.25 γH =2
Cr 2
sis
1.8
ic
ph
cy
1.4
qu en fre
1.2
om
m
Ho
gh
1
γ L =0.25
or
Gaussia
Hi
H(u,v)
γ H =2
em ph a
1.6
0.8
n High
Pass
0.6 0.4 0.2 0
0
5
10
15
l(u,v)
20
25
30
Fig. 1.45 Application of the homomorphic filter with different transfer functions. The first column shows the test image and the 1D transversal profiles of the filters; in the second column, we have the result of the Gaussian high-pass filter with l0 = 12 px; in the third column, the results of the High-Frequency Emphasis filter Hh f e (Eq. 1.103) with the control parameters a = 0.5 and b = 2 based on a Gaussian high-pass filter with cut-off frequency l0 = 12 px; and in the fourth column, the results of the Ho homomorphic filter related to Eq. 1.103 with the parameters γ L = 0.25, γ H = 2 and l0 = 12 px. The images in the second row are the equalized of the first ones
where with so (x, y) = es (x,y) and ro (x, y) = er (x,y) are, respectively, indicated the illuminance and the reflectance of the output image. Figure 1.44 shows the functional scheme of the whole homomorphic filtering process. At this point, it is necessary to design a transfer function H (x, y) to model the above-mentioned assumptions of the lighting and reflectance components. In other words, the function H (u, v) must have a simultaneous dual action: attenuate the lighting component by attenuating the coefficients of the low frequencies (thus mitigating artifacts and areas of shadows in the image) and accentuating the high-frequency coefficients to make the image more contrasting. Several homomorphic filters have been proposed in the literature to improve the visual quality of an image. Basically, based on high-pass filters (Gaussian and But-
1.22 Sharpening Filters
terworth), a homomorphic filter can be structured as follows:
2 2 Ho (u, v) = (γ H − γ L ) 1 − e−c(l (u,v)/l0 ) + γ L
67
(1.112)
where γ H is the parameter that controls the high frequencies, γ L controls the attenuation of the low frequencies, l0 represents the cut-off frequency, l(u, v) the distance between the coordinates (u, v) and the center (u, v = 0, 0) of the frequencies. The constant c controls the steepness of the curve toward the high frequencies. With γ H > 1 and γ L < 1, the filter function tends to decrease the contribution of the illuminance (which concerns to low frequencies) and amplifies the contribution of the reflectance (mainly associated with high frequencies). Finally, the result will be a simultaneous decrease in the dynamic range and the increase in contrast. From empirical evaluations, it is effective to use γ L = 0.5 to halve the spectral energy of illuminance and the high-frequency gain γ H = 1.5 ÷ 2.0, tending to double the spectral energy of the reflectance component. Figure 1.45 highlights the results of the homomorphic filter applied to an image with considerable variations of light and with very dark areas. The value of the constant is c = 0.5. Three types of filters have been applied: Gaussian normal highpass, high-frequency emphasis based on Gaussian high-pass filter, and homomorphic filter parameterized with Eq. (1.112). The results of the latter as shown in Fig. 1.45 are better by offering a balanced level in the action of making the image with uniform brightness (attenuating the low frequencies) and enhancing the high frequencies to improve the contrast and the display of the details.
References 1. J. Canny, A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 8(6), 679–698 (1986) 2. D. Marr, S. Ullman, Directional selectivity and its use in early visual processing, in Proceedings of the Royal Society of London. Series B, Biological Sciences, vol. 211 (1981), pp. 151–180 3. R.M. Haralick, Digital step edges from zero-crossings of second directional derivatives. IEEE Trans. Pattern Anal. Mach. Intell. 6(1), 58–68 (1984) 4. R.M. Haralick, L. Watson, A facet model for image data. Comput. Graph. Image Process. 15, 113–129 (1981) 5. Ron Kimmel, Alfred M. Bruckstein, On regularized laplacian zero crossings and other optimal edge integrators. Int. J. Comput. Vis. 53(3), 225–243 (2003) 6. T. Lindeberg, Edge detection and ridge detection with automatic scale selection. Int. J. Comput. Vis. 30(2), 117–154 (1998) 7. D. Marr, Vision. A Computational Investigation into the Human Representation and Processing of Visual Information, 1st edn. (The MIT Press, 2010). ISBN 978-0262514620 8. E. Marr, D.; Hildreth, Theory of edge detection, in Proceedings of the Royal Society of London. Series B, Biological Sciences, vol. 207 (1167) (1980), pp. 187–217 9. R. Deriche, Using canny’s criteria to derive a recursively implemented optimal edge detector. Int. J. Comput. Vis. 1, 167–187 (1987)
68
1 Local Operations: Edging
10. T. Young, On the theory of light and colors, in Lectures in Natural Philosophy, vol. 2 (613) (Joseph Johnson, London, 1807) 11. S.E. Umbaugh, Digital Image Processing and Analysis: Human and Computer Vision Applications with CVIPtools, 2nd edn. (CRC Press, Boca Raton, 2010). ISBN 9-7814-3980-2052
2
Fundamental Linear Transforms
2.1 Introduction This chapter introduces the fundamental linear transforms that have immediate application in the field of image processing. Linear transforms are used in particular to extract the essential features contained in the images. These characteristics, which effectively synthesize the global information content of the image, are then used for other image processing processes: classification, compression, description, etc. Linear transforms are also used to improve the visual quality of the image (enhancement), to attenuate the noise (restoration), or to reduce the dimensionality of the data (data reduction). In Sect. 9.11 Vol. I presented the Fourier Transform (FT), which is an example of a linear transform and we have already seen the usefulness in the field of image processing. In particular it has been highlighted how the DFT (Discrete Fourier Transform) extracts the characteristic frequencies of the image, expressed numerically through the amplitude (or magnitude) and phase values. We have seen, for example, how the high frequencies represent the linear geometric structures in the images, where the amplitude describes the variations of intensity (whose average value is proportional to the term DC, average value of the image) while the phase Indicates the orientation of these geometric structures. Typically a linear transform, geometrically, can be seen as a mathematical operator that projects (transforms) the input data into a new output space, which in many cases better highlights the informational content of the input data. In the case of the Fourier transform, the input image normally represents color information or light intensity, while the output space represents the frequencies, called the spectral domain. Linear transforms will be applied for discrete digital images that will be described in matrix format or as a set of vectors (rows or columns).
© Springer Nature Switzerland AG 2020 A. Distante and C. Distante, Handbook of Image Processing and Computer Vision, https://doi.org/10.1007/978-3-030-42374-2_2
69
70
2 Fundamental Linear Transforms
2.2 One-Dimensional Discrete Linear Transformation Let x = (x0 , x1 , . . . , x N −1 ) be a sequence of N samples and let T be the transformation matrix of dimensions N × N , the following equation: y= T ·x
(2.1)
defines the linear transform of the input sequence x which generates in output a new sequence of N elements y = (y0 , y1 , . . . , y N −1 ) also known as the transform coefficients. The T matrix is often called also the transformation’s kernel matrix that has a different meaning from the mask matrix of the convolution operator discussed in Sect. 9.10 Vol. I. Equation (2.1) represents a linear transform, because the output samples yi are obtained as linear combinations between the row elements of T and the input samples xi , i.e., each yi value is obtained from the scalar product of the input vector x and the column ith of T . The sequence of the input samples x can be retrieved from the samples y, through the inverse process of transformation using the following equation: (2.2) x = T −1 y in the hypothesis that the inverse matrix of T exists, i.e., T is non-singular. It can be observed how the sequence of the input samples xi obtained with the (2.2) represents the inner product between y and the columns of T = [T0 , T1 , . . . , TN −1 ]. By virtue of the Eqs. (2.1) and (2.2), y and x form a pair of transformation by T .
2.2.1 Unitary Transforms The transformation matrix T in Eq. (2.1) represents infinite transformations that can be applied to the x sequence of N samples. Among these it is useful to apply some transformations that have particular properties, that is, the unitary transformations, which occur if the inverse matrix is given by T −1 = T ∗t
(2.3)
where “*” indicates the conjugate complex of each element of T and the symbol t indicates the transposed matrix operation. If the Eq. (2.3) is satisfied, the matrix T is called the matrix ! unitary, i.e., that the matrix T is invertible and its inverse T −1 is equal to its conjugate transpose T ∗t . Often the conjugated transposed matrix of a T matrix is called the Hermitian matrix and is indicated by T H .
2.2.2 Orthogonal Transforms If the matrix T in addition to being unitar y is constituted by real numbers is called orthogonal matrix and in this case we speak of orthogonal transformations. It follows that the inverse matrix of T corresponds with the transposed and the (2.3) becomes T −1 = T t
(2.4)
2.2 One-Dimensional Discrete Linear Transformation
71
From the Eqs. (2.3) and (2.4), considering the unitary matrix T , the corresponding equations are obtained (2.5) T T ∗t = T ∗t T = I TTt = Tt T = I
(2.6)
where I is the identity matrix. Considering this, the unitary transformation expressed by (2.2) is rewritten as follows: x = T −1 y = T ∗t y = T H y
(2.7)
which represents a class of representations of x where y is a class of significant coefficients of the same vector x, most useful, in various application of signal and image processing.
2.2.3 Orthonormal Transforms For the orthogonality of the unitary matrix T , it is observed that the element (j, k)th of T T t is the inner product (scalar) of the vector column ith of T and of the vector row jth of T t . Equation (2.6) also informs us that this inner product is zero everywhere, except in the diagonal elements where j = k, which all have a unit value. A real orthogonal matrix (2.4) is also unitary, but a unitary matrix (2.3) does not need to be orthogonal. The column vectors (or rows) of a unitary matrix T are orthonormal (orthogonal and normalized) if in addition to being unitary they are two to two orthogonal to each other, namely 1 if j = k (2.8) (T j · Tk ) =< T j , Tk >= δ j,k = 0 if j = k they form a complete set of base vectors in a N -dimensional space. Recall that a normalized vector has the nor m equal to the unit x 2 =
N −1
|x j |2 = 1
j=0
In the orthonormality conditions of the N basic vectors Tk we can rewrite the unitary transformation as follows: yk = < Tk , x > =
N −1
Tk, j · x j
(2.9)
j=0
which can be considered as the projection of the input sample vector x on the associated base vector Tk . The inverse unitary transformation is given by xj =
N −1 k=0
Tk,Hj · yk
(2.10)
72
2 Fundamental Linear Transforms
The Hermitian matrix was considered in the general case of samples with complex values. The new transform coefficients are considered as the components of the input vector x in the direction of the basic vectors and, in fact, a projection (rotation, translation, …) is realized in the new space, characterized precisely by the basic vectors.
2.2.4 Example of One-Dimensional Unitary Transformation An example of a unitary transformation is given by the one-dimensional DFT. We recall, in fact, that the DFT, applied to a sequence f of N samples is given by N −1 ju 1 (2.11) f ( j) exp −i2π F(u) = √ N N j=0
with u = 0, 1, . . . N − 1. The (2.11) expressed in matrix form becomes F=W·f
(2.12)
with W unitary matrix
j 1 w j,u = √ exp −i2π u N N The inverse transform IDFT is given by N −1 1 ju f ( j) = √ F(u) exp i2π N N u=0
(2.13)
(2.14)
with j = 0, 1, . . . N − 1. In vector form it becomes f = W∗t · F The kernel column vectors Hu of the inverse transform are ⎡ ⎤ 1 u
⎢ exp i2π ⎥ N ⎢ ⎥ ⎥ .. Wu = √1 ⎢ ⎢ ⎥ N ⎣ . ⎦
(2.15)
(2.16)
exp i2π (N −1)u N
with u = 0, 1, . . . N − 1. The input samples are reconstructed with the following: f ( j) =
N −1 u=0
Wu, j · F(u)
(2.17)
2.3 Two-Dimensional Discrete Linear Transformation
73
2.3 Two-Dimensional Discrete Linear Transformation In the two-dimensional case, let’s consider the discrete image f of N × N elements. The linear transform is given by F(u, v) =
N −1 N −1
f ( j, k)T ( j, k, u, v)
(2.18)
j=0 k=0
where the indices u, v, j, k vary from 0 to N − 1 and T is the transformation matrix of dimensions N 2 × N 2 . This matrix can be thought of as blocks of data Tb (u, v) whose elements are indicated with ( j, k). The transformation is called separable if it is possible to separate the kernel matrix T into the product of component functions of TR rows and TC columns in the following form: T ( j, k, u, v) = TR ( j, u) · TC (k, v)
(2.19)
It follows that the transformation can be carried out in two phases, first by executing the one-dimensional transform per row, followed by those for columns or vice versa, as follows: −1 N −1 N F(u, v) = f ( j, k)TC (k, v) TR ( j, u) (2.20) j=0
k=0
If the two component functions TR and TC are identical, the transform is also called symmetric, not to be confused with the definition of a symmetric matrix, i.e., T = T t with T ( j, k) = T (k, j). In this case the (2.19) becomes T ( j, k, u, v) = T ( j, u)T (k, v) and consequently the (2.18) becomes F(u, v) =
N −1 j=0
T ( j, u)
N −1
(2.21)
f ( j, k)T (k, v)
(2.22)
k=0
or F = T fT
(2.23)
where T is a unitary matrix analogous to that seen for the one-dimensional transform and F unit symmetric transformation. The inverse transform is obtained as follows: f = T −1 FT −1 = T ∗t FT ∗t
(2.24)
that completely recovers the original image f . With the separable transform, the computational complexity is reduced from O(N 4 ) to O(N 3 ).
74
2 Fundamental Linear Transforms
2.3.1 Example of a Two-Dimensional Unitary Transformation As an example of a two-dimensional unitary transformation we can consider the two-dimensional DFT. In fact, the DFT, as already seen in Sect. 9.11 Vol. I is a symmetric and separable transform. In this case the transformation matrix T of (2.22) is represented by the W matrix defined with the (2.13), while the inverse matrix W −1 is the conjugated transposition of W . The direct and inverse transformations are, respectively, given by F = WfW and f = W∗t FW∗t
(2.25)
Unlike the DFT, other transformations have the transformation matrix with all real elements. A unit matrix with only real elements is also an orthogonal matrix and, consequently, the inverse transform becomes f = T t FT t
(2.26)
If then the matrix is also symmetrical, as often happens, the direct and inverse transform are identical F = T f T and f = T FT (2.27)
2.4 Observations on Unitary Transformations In general, the unitary transformation of an image can be interpreted in various ways. A first interpretation sees the unitary transformation as a transformation from one multidimensional space to another, characterized by the transformation matrix T which introduces a rotation of input vectors. A unitary transformation is characterized by the rows of the transformation matrix T , seen as basic functions. The set of rows of the transformation matrix T form a set of basic vectors representing a vector space of N dimensions. Such vectors for a unitary transformation are orthonormal, namely T T ∗t = I ⇔
N −1
T(j, i) · T∗ (k, i) = δ(j, k)
(2.28)
i=0
where δ( j, k) is the Kronecker function. In other words, with the unitary transformation y = T · x any vector x can be expressed as the weighted sum of the base vectors of unit length. Considering that the unitary transformation is a linear isometry, i.e., it preserves the inner product of the basic vectors, the nor m remains invariant (i.e., the length of the vectors has not changed), after the transformation, and we have y2 =
N −1 i=0
|yi |2 = y∗t y = x∗t T ∗t T x = x∗t x =
N −1
|xi |2 = x2
(2.29)
i=0
It follows that the unitary transformation from the geometric point of view, introduces only a rotation in the N -dimensional space for each input vector x. The transformed vectors y can be seen as the projection of the input vectors x into the new space
2.4 Observations on Unitary Transformations
75
represented by the new bases. The basic functions in the discrete Fourier transform DFT are represented by the complex exponential components w j,u given by the (2.13) that make up the DFT kernel. Seen as a unitary linear transform the DFT is represented by (2.11) and each output vector component is obtained from the inner product between the one-dimensional complex input vector f and one of the kernel orthonormal base vectors described by Eq. (2.13). The inverse DFT, with which the input vector f is completely reconstructed, is obtained with the (2.14) by calculating the inner product between the coefficients vectors F(u), for u = 0, 1, . . . , N − 1, and the columns of the inverse matrix Wu , given by (2.16). The vector interpretation of the unitary transform can be extended also in the two-dimensional case, considering an image f as a matrix of dimensions N × N . In this case, the rows of the images can be arranged sequentially, constituting a one-dimensional vector of N 2 × 1 elements. Under conditions of 2D orthonormal transformation the transformation action, in this case, corresponds to a rotation and vice versa with the inverse transform it is possible to return to the original space. The direct linear transformation (2.18) of an image is given by introducing the concept of basic images in analogy to the basic functions seen in the one-dimensional case. This can be interpreted as an analysis process, i.e., the input image is decomposed into elementary components through the basic images. The transform coefficients are the weight information of each base image to represent the original image. Recall that the two-dimensional inverse unitary transform of (2.18) is given by f ( j, k) =
N −1 N −1
F(u, v)T −1 ( j, k, u, v)
(2.30)
u=0 v=0
where T −1 denotes the matrix of the inverse unitary transform. Equation (2.30) can be interpreted as the 2D inverse transform that reconstructs the original image f ( j, k), known as the process of synthesis of the image, which reassembles the original data starting from the coefficients (also called components) of the transformed F(u, v). This is done through a summation process with the set of basic images T −1 ( j, k; u, v) for a given transformation domain represented by (u, v). Basically, the kernel T −1 ( j, k; u, v) represents the basis images in analogy to the basis functions. In this interpretation, in the summation, each element of F represents the multiplicative coefficient of the corresponding basis image, thus generating the synthesis of the original image. A basic image can be generated by the inverse transformation of the coefficient matrix containing only a nonzero element, placed as a unit value. The number of possible matrices are N 2 and consequently correspond N 2 basis images. If some coefficients are not considered during the inverse transform (2.30), an approximate reconstruction of the original image is obtained f ( j, k) =
P−1 Q−1 u=0 v=0
F(u, v)T −1 ( j, k, u, v)
(2.31)
76
2 Fundamental Linear Transforms
where we consider only P×Q coefficients instead of N 2 . An estimate of the quadratic error σ 2 is given by σ2 =
N −1 N −1
[ f ( j, k) − f ( j, k)]2
(2.32)
j=0 k=0
The image, however, is completely reconstructed with the (2.30), in the conditions of unitary transformation according to (2.24), thus obtaining direct and inverse transformations to represent the original image in the form of a linear combination of a set of basis images. In analogy to the one-dimensional case, the basis images can be thought of as a set of base components, where any image can be rebuilt. In other words, the direct transform performs the decomposition of the original image, generating the coefficients that characterize it in this new domain. The inverse transform performs the reconstruction by the summation process of the basis images weighed by the relative coefficients. With the existence of infinite sets of basis images, it follows that the possible transformations are also infinite. In relation to the type of application, it is necessary to choose the appropriate basis images for a particular transformation.
2.4.1 Properties of Unitary Transformations We have previously pointed out that in unitary transforms the inner product on vectors preserves their nor m (2.29) and consequently the energy associated with the input signal is preserved. In fact, from a geometric point of view, the unitary transformation introduces a rotation of the vectors representing the input samples. In the 2D context, for an image f ( j, k) of dimensions N × N it is shown that N −1 N −1 j=0 k=0
| f ( j, k)|2 =
−1 N −1 N
|F(u, v)|2
(2.33)
u=0 v=0
Unitary transforms tend to concentrate the energy of the input signal into a small subset of the coefficients of the transform itself. A quantitative measure of this property is evaluable by calculating the mean and covariance of the input and output signals. In addition, the coefficients tend to be less correlated. Finally, entropy, seen as a measure of information intrinsic to the signal (or uncertainty level), is preserved in unitary transforms. For different applications (compression and transmission, dimensionality reduction, restoration,…) it is useful to have transformed that better compact the energy and decorrelate the data in a meaningful way (to reduce, for example, the dimensionality and the computational load). The following paragraphs will describe other transformations based also on the statistical characteristics of the input signal. Those described so far have been non-dependent on the source signal.
2.5 Sinusoidal Transforms
77
2.5 Sinusoidal Transforms Let us now return to the Fourier transform and remember that it is a unitary transformation with separable and symmetrical transformation kernel matrices. The onedimensional Fourier transform has the basis functions consisting of the complex exponential components given by (2.13), which can be decomposed into the sine and cosine sinusoidal components described in Sect. 9.11.1 Vol. I. In matrix form the one-dimensional DFT is given by F0 w0,0 w0,1 . . . w0,N −1 f 0 F1 w1,0 w1,1 . . . w1,N −1 f 1 (2.34) .. .. = .. .. .. . . . . ... . FN −1 w N −1,0 w N −1,1 . . . w N −1,N −1 f N −1 where f = ( f 0,1,......,N−1 ) is the complex vector of input and F = (F0,1,......,N −1 ) is the spectrum vector of the one-dimensional DFT and the kernel matrix elements w j,k are given by (2.13). Kernel matrix W is a unitary matrix considered to be the specific periodicity of the imaginary exponential component. If the input vector represents a sequence of samples expressed by real numbers, the output vector F will generally consist of complex components. If the input signal is in symmetrical, F would be with real components. The direct transform DFT and the inverse transform of the DFT (often denoted by IDFT), recalling what was discussed in the previous paragraph, would result (2.35) F = Wf and f = W∗t F For an image f of size N × N , the direct and inverse DFT, in matrix form, are rewritten as follows: (2.36) F = WfW and f = W∗t FW∗t where W is the kernel matrix of the transform consisting of N × N basis images W C,R of the following type: 1 wuN 2u (N −1)v (2.37) WC,R = w N 1 wvN w2v = WC ⊗ W R . . . w N N .. (N.−1)u w N where “⊗” indicates the symbol of the outer product and the exponential terms have been represented in the most compact form with respect to (2.13) as follows: −i2π (2.38) w N = exp N With reference to the exponential terms (2.38), the direct unitary transforms DFT and the inverse transform IDFT can be rewritten as follows: N −1 N −1 1 ju F(u, v) = √ f ( j, k)w N wkv N N j=0 k=0
0 ≤ u, v ≤ N − 1
(2.39)
78
2 Fundamental Linear Transforms
N −1 N −1 1 − ju f ( j, k) = √ F(u, v)w N w−kv N N u=0 v=0
0 ≤ j, k ≤ N − 1
(2.40)
with the advantage of having them in separable form. It can be observed from the Eq. (2.37) how the basis matrices are expressed in terms of the outer product of the column and row vectors representing the basis functions of the two-dimensional DFT. Under these conditions the kernel matrix is said to be separable. In the following paragraphs we will describe other unitary transforms (transformations such as the sine, cosine, Hartley, etc.), which use sinusoidal functions as a base, in analogy to what is seen with the Fourier transform. Although the Fourier transform have the benefit of the important properties discussed and is useful for image processing applications, it presents a computational problem because it operates with complex numbers. It is, therefore, necessary to consider also other transforms that can operate with real numbers.
2.6 Discrete Cosine Transform (DCT) The DCT represents a signal or an image as a sum of sinusoids of various magnitudes and frequencies [1–4]. The DCT of a 1D sequence of input f ( j), for j = 0, 1, . . . , N − 1, is defined by the following: N −1 π(2 j + 1)u (2.41) f ( j) cos Fc (u) = c(u) 2N j=0
with u = 0, 1, . . . , N − 1. The inverse DCT is given by N −1 N −1 π(2 j + 1)u f ( j) = c(u)Fc (u) cos 2N
(2.42)
u=0 v=0
with j = 0, 1, . . . , N − 1. It can be observed as for the DFT, also for the DCT the first coefficient, i.e., for u = 0 we have N −1 1 Fc (0) = f ( j) (2.43) N j=0
obtaining the average value of the input sequence, normally referred to as the DC coefficient. In the 2D context the DCT is defined as N −1 N −1 π(2k + 1)v π(2 j + 1)u cos f ( j, k) cos Fc (u, v) = c(u)c(v) 2N 2N j=0 k=0
(2.44) where the factors c(u) and c(v) are defined by the relationships 1 2 c(0) = and c(u) = c(v) = N N
(2.45)
2.6 Discrete Cosine Transform (DCT)
79
with u, v = 1, 2, . . . , N − 1. The inverse DCT is given by N −1 N −1 π(2k + 1)v π(2 j + 1)u cos c(u)c(v)Fc (u, v) cos f ( j, k) = 2N 2N u=0 v=0 (2.46) with j, k = 0, 1, . . . , N − 1. The direct and inverse transformation of the 2D DCT can also be expressed in a matrix form as a unitary transformation given by the following expressions: Fc = CfCt
f = Ct Fc C
(2.47)
where f is the input image (handled in image blocks of dimensions N × N pixels, normally with N = 8), Fc the matrix N × N containing the coefficients of the transform, and C denotes the kernel matrix of size N × N of the DCT whose elements are given by π(2 j + 1)u (2.48) C( j, u) = c(u) cos 2N Equation (2.47) is motivated by the fact that the DCT presents only real values of the coefficients and being orthogonal we have C = C∗
C−1 = Ct
(2.49)
Since the 2D DCT can be separable, the realization of the 2D DCT is made through 1D DCT transformations. In particular, with the first transform C · f we obtain a matrix N × N (intermediate result) whose columns contain the 1D DCT of f, while with the second 1D transform, applied to the intermediate result, we obtain the 2D DCT corresponding to Fc = CfCt according to (2.47). The inverse transform is obtained with f = Ct Fc C. Applying the DCT direct to an image block of N × N gives a matrix of coefficients of the same dimensions, the value of which represents the weight corresponding to the measure with which each basis image is present in the original image. The inverse DCT (2.46) shows that a block of the original image f ( j, k) can be completely reconstructed from the sum weighed (by the coefficients) of N × N basis images represented in the form π(2k + 1)v π(2 j + 1)u cos (2.50) c(u)c(v) cos 2N 2N With 8 × 8 image blocks, the indices in the spatial and frequencies domain range from 0 to 7. The generic pixel ( j, k) would be reconstructed as follows: 7 7 π(2k + 1)v π(2 j + 1)u cos (2.51) f ( j, k) = c(u)c(v)Fc (u, v) cos 16 16 u=0 v=0
The 16 and 64 basis images for blocks of 4 × 4 and 8 × 8 dimensions, respectively, are shown in Fig. 2.1. It can be seen that the origin in each block is at the top left with the increasing frequencies from left to right and vertically from top to bottom. The first basis image has a constant value, as expected, as it represents the DC component
80
2 Fundamental Linear Transforms
Fig. 2.1 Basis images 4 × 4 and 8 × 8 of the Cosine Transform; for each block the origin is in the upper left corner. It can be noted that the basis images exhibit a progressive increase in frequency both in the vertical and horizontal directions. The top left basis image assumes a constant value and is referred to as the DC coefficient. The gray represents zero, white represents positive amplitudes, and black represents negative amplitudes
corresponding to the coefficient Fc (0, 0). In Fig. 2.2 are shown the basic 3D images for blocks for N = 32 relative to the frequencies (u, v) = (1, 4) and (u, v) = (2, 6). Figure 2.3 illustrates the results of the DFT and DCT transform with the relative spectra from which their diversity emerges. In particular, it can be noted for the DCT how all the energy is concentrated in the low frequencies. In the high frequencies, the coefficients have low values, while the highest values are found in the low frequencies, to which the human eye tends to be more sensitive. A DCT property is to remove the redundancy of the information constituted by the similarity of adjacent pixels. In practice, DCT produces very uncorrelated coefficients. This property along with the reduction of significant coefficients enables the efficient quantizing of the original image by obtaining high levels of data compression. This property of DCT makes it effective in data compression applications useful for their storage and transmission. For example, DCT was adopted as the compression transform by the Working Group (Joint Photographic Experts Group) which defined the international standard called JPEG for compression and encoding of photographic images. Figure 2.4 reports the results of the DCT in this case applied by reconstructing the image with loss of data with different levels of compression. Basically, the original images can be reconstructed, considered still acceptable, from the visual point of view, although up to 80% of the coefficients have been excluded. As with DFT, the DCT can also be calculated with fast ad hoc algorithms based on FFT.
2.6 Discrete Cosine Transform (DCT)
81
DCT - Basis image 32x32 (u,v)=(2,6)
DCT - Basis image 32x32 (u,v)=(1,4)
0.1
0.05
0.05
0
0
−0.05
−0.1 40
−0.05 40 30
40
y
20
20
10
10 0
0
x
30
40 30
20
y
20
10
10 0
x
0
Fig. 2.2 Discrete Cosine Transform: 3D representation of 2 basis images 32 × 32 calculated for spectral coordinates (u, v) = (2, 6) and (u, v) = (1, 4) DCT
DFT
Fig. 2.3 Difference of the spectral distribution between Cosine Transform and Fourier Transform for a same image; the circle indicates a possible truncation of the cut-off frequencies
Original
rms 3.4%
rms 4.3%
rms 5.8%
Fig. 2.4 Application of Cosine Transform for image compression; the results are related to the decomposition in blocks of 8×8 of the original image reconstructed with only 16, 8 and 4 coefficients of the 64 with a loss of data, respectively, with rms (root mean squared error) of the 3.4%, 4.3% and 5.8%
82
2 Fundamental Linear Transforms
2.7 Discrete Sine Transform (DST) The DST is a Fourier-related transform similar to the discrete DFT, but uses a only real matrix. The DST has been defined by Jain [3] as Fs (u, v) =
N −1 N −1 π(2k + 1)(v + 1) 2 π( j + 1)(u + 1) sin f ( j, k) sin N +1 N +1 N +1 j=0 k=0
(2.52) with u, v = 0, 1, . . . , N − 1. The inverse of the DST is given by N −1 N −1 π(k + 1)(v + 1) π( j + 1)(u + 1) 2 cos Fs (u, v) sin f ( j, k) = N +1 N +1 N +1 u=0 v=0 (2.53) Also the DST is a unitary transformation whose kernel matrix is given by 2 π( j + 1)(k + 1) (2.54) S( j, k) = sin N +1 N +1 The DST is real, symmetrical, and orthogonal. Most information (energy) is contained in the low frequency area. Compared to other transforms, DST has a reduced computational load for for implementations with N = 2 p−1 where p is an integer. In this case the DST is obtained by considering the imaginary part of the FFT calculated for the particular case of (2N + 2) points. Although the direct application of DST would require O(N 2 ) operations, it is possible to compute the same thing with only O(N log N ) complexity by factorizing the computation similar to the fast Fourier transform (FFT). There are several variants of implementing the DST with slightly modified definitions. A DST can also be computed from a DCT, respectively, by reversing the order of the inputs and flipping the sign of every other output, and vice versa for DST from DCT. In this way it follows that the DST require exactly the same number of arithmetic operations (additions and multiplications) as the DCT. DST is widely used for signal processing (for example, speech reconstruction and de-noising) and image compression.
2.8 Discrete Hartley Transform (DHT) As an alternative to the Fourier transform, in 1942 Hartley [5] introduced a version of continuous integral transform applied in the context of transmission applications. Bracewell in 1986 [5] proposed a discrete unitary transform based on the Hartley transform (Discrete Hartley Transform—DHT). The DHT and its inverse, proposed by Bracewell, are N −1 N −1 1 2π f ( j, k)cas (u j + vk) (2.55) FH (u, v) = N N j=0 k=0
2.8 Discrete Hartley Transform (DHT)
f ( j, k) =
83
N −1 N −1 1 2π FH (u, v)cas (u j + vk) N N
(2.56)
u=0 v=0
where cas(θ ) = cos(θ ) + sin(θ ) =
√ √ 2 cos(θ − π/4) = 2 sin(θ + π/4)
known as the Hartley kernel. DHT and DFT have a structural analogy that can be observed by comparing the Eqs. (2.55) and (2.56) with those of Fourier when it is expressed with the complex exponential decomposed in terms of sine and cosine. Bracewell has shown that the function cas(θ ) is an orthogonal function and that DHT, while having a mathematical similarity to DFT, does not maintain the same properties as DFT. While DFT transforms real numbers into complex numbers with the property of conjugated symmetry (Hermitian redundancy), DHT still produces real numbers. In particular, in the DFT with half of the spectral domain, it is sufficient to reconstruct the input image f ( j, k), applying the inverse transform DFT, while in DHT it is necessary to consider all the domain of the DHT transform, since the information is distributed over the whole domain, not presenting redundancy and symmetry. The antisymmetry characterizing the Fourier domain is not found in the Hartley domain and each point of it is important. DHT also has a fast algorithm [6], known as FHT (Fast Hartley Transform) analogous to FFT. Stanford University has patented a version of the FHT algorithm that has become the public domain since 1994. DHT can also be expressed in matrix terms with the kernel having elements expressed by jk 1 (2.57) cas 2π H ( j, k) = √ N N The choice to use DHT or DFT, in relation to the computational convenience, depends on the type of application. In linear filtering applications, DHT is particularly convenient from a computational point of view, especially when the kernel is symmetric, since operations with complex numbers are avoided. However, the FHT and the FFT version operating with real-world input numbers are computationally comparable, and this explains the increased availability of software libraries that include FFT algorithms, and ad hoc firmware,1 implementations for some hardware devices.
1 Firmware
is a set of sequential instructions (software program) stored in an electronic device (an integrated, an electronic board,…) that implement a certain algorithm. The word consists of two terms: firm with the meaning of stable which in this case indicates the immediate modifiability of the instructions, while ware indicates the hardware component that includes a memory (read-only or rewritable for software update), computing unit and communication unit with other devices. In other words, the firmware includes the two software and hardware components.
84
2 Fundamental Linear Transforms
2.9 Transform with Rectangular Functions Let us now consider the transforms of Hadamard, Walsh, Haar, and Slant of interest in the field of digital image processing. Compared to previous transforms, based on sinusoidal functions, they differ in the use of basis functions described by the rectangular or square waves with peak values of more or less the unit. The term rectangular or square basis function derives from the concept of rectangular or square wave. The width of the rectangular impulse can be variable (see Fig. 2.5). The rectangular waves depicted in the figure are a useful mathematical idealization that can hardly model a physical reality (for example, an electrical signal or any quantity exhibiting variation in time or variation in space such as an image) due to the presence of noise. Compared to sinusoidal transforms, the transformations with rectangular basis functions have significant computational advantages.
2.9.1 Discrete Transform of Hadamard—DHaT The Hadamard transform [2,4,7] is a separable and symmetric unitary transform with the elements of the kernel matrix H N with values of +1 and −1. The general form 1D of the DHaT transform considering N = 2n samples f ( j), j = 0, 1, . . . , N − 1 is given by N −1 n−1 1 FH (u) = f ( j)(−1) i=0 bi ( j)bi (u) (2.58) N j=0
Rectangular Basis functions of various transforms N=8 Walsh-Hadamard
Haar
Slant
0 1
Sequence
2
3 4 5
6
7 0 1 2 3 4
5 6 7
0 1 2 3 4
5 6 7
0 1 2 3 4
5 6 7
j
Fig. 2.5 Basis rectangular functions of the Hadamard, Haar and Slant transforms for N = 8
2.9 Transform with Rectangular Functions
85
with u = 0, 1, . . . , N − 1; bm (r ) indicates the mth bit in the binary coding of the number r , and the sum in the exponential is made with the arithmetic module of 2. For example, for n = 3 we have that N = 8 and choosing r = 5 which has the binary coding 101 will have the following: b0 (r ) = 1, b1 (r ) = 0, b2 (r ) = 1. The resulting kernel matrix is, therefore n−1 1 (2.59) ( j)(−1) i=0 bi ( j)bi (u) N whose rows and columns are orthogonal, of dimension N × N elements, where N = 2n for n = 1, 2, 3 and satisfies the following equation:
H N ( j, u) =
H N · HtN = I
(2.60)
HtN
where I is the identity matrix of order N and is the transposed matrix of the same kernel matrix H N . The 1D inverse transform of DHaT results f ( j) =
N −1
n−1
FH (u)(−1)
i=0
bi ( j)bi (u)
(2.61)
u=0
The 2D DHaT transform is given by the following: FH (u, v) =
N −1 N −1 n−1 1 f ( j, k)(−1) i=0 bi ( j)bi (u)+bi (k)bi (v) N
(2.62)
j=0 k=0
with u = v = 0, 1, . . . , N − 1 and the other symbols have the same meaning described above. The 2D DHaT inverse transform results f ( j, k) =
N −1 N −1 n−1 1 FH (u, v)(−1) i=0 bi ( j)bi (u)+bi (k)bi (v) N
(2.63)
u=0 v=0
The minimum-sized Hadamard kernel matrix is 2 × 2 elements with N = 2 and is given by 1 1 1 (2.64) H2 = √ 2 1 −1 The H 2 kernel matrix is the generating matrix of all the N > 2 kernel matrices of Hadamard on condition that N is power with base 2. In fact, for N > 2 values the kernel matrix can be created with blocks of H i matrices of the following form: 1 H H N /2 (2.65) H N = √ N /2 N H N /2 −H N /2 For N = 4 we have the Hadamard kernel matrix of order 4 which would result 1 1 1 1 0 1 H H 2 1 1 −1 1 −1 3 (2.66) H 4 = √ 2 = N H 2 −H 2 2 1 1 −1 −1 1 1 −1 −1 1 2
86
2 Fundamental Linear Transforms
for N = 8 by virtue of (2.65) the Hadamard kernel matrix is as follows: 1 1 1 1 1 1 1 1 0 1 −1 1 −1 1 −1 1 −1 7 1 1 −1 −1 1 1 −1 −1 3 1 1 −1 −1 1 1 −1 −1 1 4 H 8 = √ 8 1 1 1 1 −1 −1 −1 −1 1 1 −1 1 −1 −1 1 −1 1 6 1 1 −1 −1 −1 −1 1 1 2 1 −1 −1 1 −1 1 1 −1 5
(2.67)
In the kernel matrix (2.66) and (2.67) the column to the right represents the number of sign changes in the corresponding row. It is observed that the number of sign changes is variable in each row and is called the row sequence. Hermite suggested an interpretation in terms of frequency for the Hadamard kernel matrix generated with the root matrix of the Eq. (2.65). The sequence of the row varies from 0 to N − 1 which characterizes the Hadamard unitary matrix. The rows of the Hadamard matrix can be imagined derived from samples of rectangular waves having a sub-period of 1/N units. If the rows are rearranged in such a way as to arrange them with the increasing sequence value, a new arrangement of the rows that are interpreted in terms of frequency will be increased. Considering this, the reordered Hadamard matrix H 8 can be derived from (2.67) by imposing the increasing sequence numbers, thus obtaining the following: 1 1 1 1 1 1 1 1 0 1 1 1 1 −1 −1 −1 −1 1 1 1 −1 −1 −1 −1 1 1 2 1 1 1 −1 −1 1 1 −1 −1 3 (2.68) H8 = √ 8 1 −1 −1 1 1 −1 −1 1 4 1 −1 −1 1 −1 1 1 −1 5 1 −1 1 −1 −1 1 −1 1 6 1 −1 1 −1 1 −1 1 −1 7 Recall the numbers shown in the column adjacent to the two Hadamard matrices H 4 and H 8 are called the sequence of a column of the matrix and each indicates the number of sign changes occurring in the column. For example, for the matrix H 4 it can be directly verified that the sequences of each column correspond, respectively, to 0, 3, 1, 2. For the Hadamard transform is improper to speak about the frequency domain, because basis functions can hardly be associated with a frequency concept as was the case with sinusoidal functions. One could associate the concept of frequency for the variable u with that of the sequence that increases as u increases. Figure 2.5 shows the 1D basis images of Hadamard (for N = 8) with the corresponding sequence that grows in analogy with the sinusoidal functions arranged with increasing frequency. The spatial coordinate j is shown horizontally. Now let’s see how to characterize the kernel to realize what is represented graphically, that is, to say that the frequency
2.9 Transform with Rectangular Functions
87
u grows according to the growing sequence. For this purpose, the 1D DHaT kernel matrix is expressed as follows: n−1 1 (2.69) ( j)(−1) i=0 bi ( j) pi (u) N where bi ( j) has already been defined previously while pi (u) is defined as follows:
H N ( j, u) =
p0 (u) = bn−1 (u) p1 (u) = bn−1 (u) + bn−2 (u) p2 (u) = bn−2 (u) + bn−3 (u) .. .
(2.70)
pn−1 (u) = b1 (u) + b0 (u)
Also for these expressions the sums are intended with arithmetic module of 2. The inverse kernel matrix of the remodulated Hadamard H N−1 is identical to the direct one (2.69) and the 1D direct and inverse transform DHaT are rewritten as follows: FH (u) =
N −1 n−1 1 f ( j)(−1) i=0 bi ( j) pi (u) N
(2.71)
j=0
f ( j) =
N −1
n−1
FH (u)(−1)
i=0
bi ( j) pi (u)
(2.72)
u=0
We can rewrite the DHaT and kernel matrix equations (remembering that the direct and the inverse are identical) for the 2D context that are N −1 N −1 n−1 1 f ( j, k)(−1) i=0 bi ( j) pi (u)+bi (k) pi (v) N
(2.73)
N −1 N −1 n−1 1 FH (u, v)(−1) i=0 bi ( j) pi (u)+bi (k) pi (v) N
(2.74)
FH (u, v) =
j=0 k=0
f ( j, k) =
u=0 v=0
Figure 2.6 shows the ordered basis images of DHaT for N = 4 and N = 8. They are generated by the 1D DHaT basis functions and each basis image is obtained with the outer product of the corresponding 1D bases (the constant term 1/N was not considered). In analogy to the basis images of the DCT, the origin of the frequencies (u, v) are always at the top left with the spatial variables (i, j) that vary between 0 and N − 1. In this case, the values of the basis images +1 and −1 correspond, respectively, to the white and the black. For example, the 2D basis image, with sequences (u, v) = (1, 3), indicated with red rectangle in Fig. 2.6, is obtained from the outer product of the 1D basis functions (the second row H8 (1) and the fourth
88
2 Fundamental Linear Transforms
Fig. 2.6 Ordered basis images 4 × 4 and 8 × 8 of Hadamard Transform; for each block the origin is in the upper left corner
row H8 (3) of the ordered Hadamard matrix (2.68)), given by 1 1 1 1 −1 −1 1 1 1 1 −1 −1 −1 −1 −1 −1 1 1 −1 −1 −1 −1 1 1 H 8 (3, 1) = H 8 (1) H 8 (3)t = 1 1 1 1 −1 −1 1 1 1 1 −1 −1 −1 −1 −1 −1 1 1 −1 −1 −1 −1 1 1
−1 −1 −1 −1 1 1 1 1 −1 −1 −1 −1 1 1 1 1
(2.75)
In the first column in Fig. 2.7 is showed the DHaT spectrum, in logarithmic scaled format for better visualization and the completely reconstructed image by applying the DHaT inverse transform (2.74). It is observed, unlike transformation with sinusoidal basis, the non-concentrated distribution of the spectrum. The following columns display the spectrum and the relative image reconstructed after the coefficients have been filtered with absolute value of the module lower than a certain threshold. The results of the reconstructed images are shown by eliminating the coefficients below the various thresholds of the absolute value of the magnitude with the corresponding mean squared error of the reconstructed image and the percentage of zeroed coefficients. The computational load required by DHaT is lower than the sinusoidal transforms and in an optimized version the computational complexity from O(N 2 ) is reduced to only O(N log N ) additions and subtractions, where N is the order of the kernel matrix.
2.9 Transform with Rectangular Functions
89
Hadamard transform Full spectrum
ABS(magnitude) < 20
ABS(magnitude) < 30
Reconstructed image
ABS(magnitude) < 10
Original
rms 2.85%
rms 4.7%
rms 6.4%
Fig. 2.7 Application of Hadamard transform; the results are obtained by reconstructing the image by filtering the coefficients with thresholds of the absolute value of the always larger amplitude equal to 10, 20 and 30 thus obtaining a loss of data, respectively, with rms of 2.85%, 4.7%, and 6.4% and a corresponding percentage of zero coefficients of 55.5%, 81%, and 89.8%
2.9.2 Discrete Transform of Walsh (DWHT) The Hadamard matrix can be generated by sampling a class of rectangular functions called Walsh functions [3,8]. These functions assume binary values of +1 or −1 and form complete orthonormal basis functions. The Hadamard transform, for such reasons, in literature is also called Walsh–Hadamard’s transformation. In Walsh’s functions, also expressed as rectangular functions, the value of the sequence defined in the Hadamard matrix corresponds to the number of zero crossing observed in the rectangular wave. The 1D direct and inverse transform of Walsh– Hadamard of image f (i, j) is given by the following equations: N −1 n−1 1 f ( j) (−1)bi ( j)bn−1−i (u) N
(2.76)
N −1 n−1 1 FH W (u) (−1)bi ( j)bn−1−i (u) N
(2.77)
FH W (u) =
j=0
f ( j) =
i=0
u=0
i=0
with u, j = 0, 1, . . . , N −1; bm (r ) indicates the mth bit in the binary encoding of the number r , and N = 2n The direct and inverse DWHT kernel matrices are identical and are given by n−1 1 (−1)bi ( j)bn−1−i (u) (2.78) W N ( j, u) = N i=0
90
2 Fundamental Linear Transforms
Fig.2.8 Application of the Walsh–Hadamard transform; the results are related to the decomposition into blocks of 8 × 8 of the original image with the image blocks reconstructed with different cut-off sequences. The first row shows the complete spectrum of the image; they follow the block spectra with cutting sequences s1 < 8 s2 < 6, s3 < 4 and s4 < 3. The second row shows the reconstructed images with the relative cut-off sequences mentioned
with the characteristic of being a symmetrical matrix of dimensions N × N , whose columns and rows are orthogonal vectors. The DWHT transform extended to the 2D context is immediate, and we can write the relative equations of the direct and inverse transform as follows: N −1 N −1 n−1 1 f ( j, k) (−1)[bi ( j)bn−1−i (u)+bi (k)bn−1−i (v)] (2.79) FH W (u, v) = N j=0 k=0
f ( j, k) =
i=0
N −1 N −1 n−1 1 FH W (u, v) (−1)[bi ( j)bn−1−i (u)+bi (k)bn−1−i (v)] N u=0 v=0
(2.80)
i=0
u, v, j, k = 0, 1, . . . , N − 1; bm (r ) indicates the mth bit in the binary encoding of the number r , and N = 2n . Another representation of the Walsh–Hadamard transform would be to consider the Hadamard matrix in the ordered form as indicated in (2.68). Figure 2.5 shows the one-dimensional basis functions of the Walsh–Hadamard transform for N = 8. In Fig. 2.6 are shown the basic images of the Walsh–Hadamard transform for N = 4 and N = 8 (+1 represents white and −1 black). Figure 2.8 shows the results of the Walsh–Hadamard transform applied to the same test image. The concentration of the significant coefficients toward the low values of the sequences can be observed, unlike the Hadamard spectrum obtained with the unordered sequences (see Fig. 2.7). The transform was applied on 8 × 8 pixel image blocks and the image reconstruction was performed including only the coefficients corresponding to the values of the sequences (u, v) included in the circle of radius lower than the threshold value s0 (cut-off sequence) given by (2.81) s0 < u 2 + v2
2.9 Transform with Rectangular Functions
91
The figure shows a magnification of a portion of 8 × 8 spectra with highlighted circles that include the coefficients used for the reconstruction of the image blocks to the various cut-off sequences (expressed in pixels). The reconstruction errors and the percentage of excluded coefficients at the different cut-off sequences are also highlighted.
2.9.3 Slant Transform We know that the process of the transforms is to project the input image f ( j, k) in the basis images T ( j, k, u, v) each of them is characterized by the frequency values (u, v) in the transform domain and the values ( j, k) in the input spatial domain (Eq. 2.18). The output values of the transform, i.e., the coefficients F(u, v) are the projections of f ( j, k) in the basis images T ( j, k, u, v) and the informative content of these coefficients indicates how similar is the input image with respect to each basis image. High values of the coefficients indicate a high similarity. In other words, the transformation process calculates the level of decomposition of the image as the weighted sum of the basis image, where the coefficients F(u, v) are precisely the weights (see Eq. 2.30). All the transforms considered have calculated the coefficients FX (u, v) which informs us to what degree the information of the input image is present on each basis image. The information contained in the coefficients FX (u, v) is of global type and consequently the local information of the source space is lost. If, for example, in an image there are different types of cars, any previous linear transform encodes the image in various coefficients that contain a global information of the scene observed losing the spatial characteristics. Hence the need to have linear transforms that tend to preserve, in some way, also the local information of the source domain. A first approach to partially solving this problem is to design images or basis functions that do not have constant values, such as rectangular functions, but to define basis images with values that vary linearly. The slant transform [2,4] is an orthogonal linear transform with the following characteristics: in addition to the constant basis functions, it also contains basis slant functions, i.e., which decrease monotonously at constant traits from a maximum value to a minimum. With this type of basis images the coefficients of the slant transform capture the informative content present in many images rich in structures that vary linearly in intensity (zones of linear discontinuity present in the images, for example, in the edges). The unitary Kernel matrix S1 , of the slant transform, is obtained by considering the H 2 Hadamard matrix (see Eq. 2.64), or as we will see that of Haar. This matrix is given by 1 1 1 S1 = H 2 = √ (2.82) 2 1 −1
92
2 Fundamental Linear Transforms
To obtain higher order kernel matrices of Slant–Hadamard transform we proceed by iterating (2.82) recursively according to the following relation 1 an 1 0 p×2 Sn = √ 2 0 −bn 0 p×2
02× p bn −an bn Sn−1 0 Ip 0 p×2 Ip 1 0 −1 02× p 02× p 0 Sn−1 an bn an Ip 0 p×2 −I p 1
0
0
02× p
(2.83)
where N = 2n , n = 1, . . . denotes the order of matrix Sn , p = N2 − 2, and I p denotes an identity matrix of order p. The parameters an and bn are defined by the following recursive relation an+1 =
3N 2 4N 2 −1
1 2
bn+1 =
N 2 −1 4N 2 −1
− 1 2
(2.84)
The Slant–Hadamard kernel matrix of order N = 21 = 2 is given by (2.82), the S2 matrix of order N = 22 = 4, note the S1 , according to (2.83) and to (2.84) it is obtained as follows: ⎤ ⎡ ⎤ ⎡ 1 1 1 1 0 1 0 1 0 −1 √ −3 ⎥ ⎢ √3 √1 √ ⎥ 1 1 ⎢ 1 b −a b 0 S a ⎥ ⎢ 2 2 2 2 ⎥ 1 2 (2.85) = √ ⎢ 5 5 5 5⎥ S2 = √ ⎢ ⎦ ⎣ 2 0 1 0 −1 0 S 1 −1 −1 1 ⎦ ⎣ 2 1 2 2 −3 3 −1 1 √ √ √ √ 3 −b2 a2 b2 a2 5
5
5
5
The column next to the matrix indicates the sequence number in ascending order. In Fig. 2.9a the basis images of the slant transforms for N = 4 are shown and Fig. 2.10 shows the results of the slant transform applied for the same test image used for
Fig. 2.9 a Basis images 4 × 4 and 8 × 8 of the slant transform and b Basis images 8 × 8 of the Haar transform; for each block the origin is in the upper left corner
2.9 Transform with Rectangular Functions
93
Slant transform
Reconstructed image
Full spectrum
rms 6.75%
rms 7.37%
rms 9.42%
rms 9.78%
Fig.2.10 Application of the slant transform; the results are related to the decomposition into blocks of 8 × 8 dimensions of the original image with the reconstructed image blocks with different cut-off sequences. The first line shows the full spectrum of the image; they follow the block spectra with cut-off sequences s1 < 8, s2 < 6, s3 < 4 and s4 < 3. The second line shows the reconstructed images with the relative cut-off sequences mentioned
the previous transforms. As with the Walsh–Hadamard transform, the image was elaborated by decomposing it into blocks of 8 × 8 pixels and applied the same method of image reconstruction. Even for the slant transform, there is an implementation that optimizes the computational load. Other features of the slant transform: it is real and orthogonal, i.e., Si = Si∗ and Si−1 = Si∗ and finally shows, like the previous transforms, a high compaction of the energy toward the low frequencies. The slant transform is also very much used for data compression applications.
2.9.4 Haar Transform Also this transform, like that of slant, before discussed, allows to preserve the local information of the image. In fact, while in the Fourier transform the basis images differ in frequency, the basis images of Haar vary by size (scale) and by position. It can be seen in Fig. 2.9b as the basis functions of Haar vary in scale and position and it is evident how they are different from the basis functions discussed so far. This diversity with the dual scale-position nature will be further deepened with the wavelets functions which will be discussed in Sect. 2.12. The Haar transform [2,9] is symmetrical, separable, and unitary. The Haar functions h k (x) are defined in a continuous interval x ∈ [0, 1] with k = 0, 1, . . . , N − 1, with N = 2n . To control the two aspects of the variability (position and scale) of the Haar function it is necessary to introduce two indices p and q to define each of the k functions (2.86) k = 2p + q − 1
94
2 Fundamental Linear Transforms
where 0 ≤ p ≤ n − 1; q = [0, 1] for p = 0 and 1 ≤ q ≤ 2 p for p = 0. For example, if N = 4 the indices vary as follows: k k k k
=0 =1 =2 =3
p p p p
=0q =0q =1q =1q
=0 =1 =1 =2
If we consider k decomposed in the indices ( p, q) the Haar functions are defined as follows: h 0 (x) ≡ h 0,0 (x) = √1 x ∈ [0, 1] (2.87) N
h k (x) ≡ h p,q (x) =
√1 N
⎧ p/2 q−1 q−1/2 ⎨2 2p ≤ x < 2p q−1/2 −2 p/2 2 p ≤ x < 2qp ⎩ other wise 0
(2.88)
To have discrete values of x we consider x = i/N , with i = 0, 1, . . . , N − 1 we obtain a set of Haar functions, each of which is an odd rectangular pulse with the exception of k = 0 that is constant as for the transformations discussed so far. We have already seen in Fig. 2.9b that the basis functions vary in scale (index p) and in position (index q). For N = 8 the Haar matrix, denoted with H r , is given by 1 1 √ 2 1 0 √ Hr = 8 2 0 0 0
1 1 1 1 1 1 √ √ √ 2− 2− 2 0 0 0 −2 0 0 0 2 −2 0 0 0 0 0 0
1 −1 √0 2 0 0 2 0
1 −1 √0 2 0 0 −2 0
1 −1 √0 2 0 0 0 0
Sequence 1 0 −1 1 2 0 √ 3 2 → 4 0 5 0 6 0 7 0
(2.89)
The Haar transform reduces computational complexity considering that there are many zero values in the kernel matrix (2.89). Figure 2.9b shows the basis images of Haar for N = 8 and it can be seen how the basis images at the bottom right are useful for finding small structures in different positions in the source image. Figure 2.11 presents the result of the Haar transform applied to the same test image of the previous transforms and applying the same image reconstruction modes (block image decomposition and various cut-off sequences) used for the Walsh and Slant transformations. From (2.89) it is noted that the basis functions are sequentially ordered. From the analysis of the results we observe for the Haar transform, unlike previous transforms, a lower spectral energy compaction. Summarizing the Haar transform exhibits poor energy compactness, fast implementation and has the property of operating on real numbers and being orthogonal (H r = H r∗ ; H r−1 = H rt ).
2.10 Transform Based on the Eigenvectors and Eigenvalues
95
Haar transform
Reconstructed image
Full spectrum
rms 6.22%
rms 6.29%
rms 7.42%
rms 8.44%
Fig. 2.11 Application of the Haar transform; the results are related to the decomposition into blocks of 8 × 8 of the original image with the image blocks reconstructed with different cut-off sequences. The first line shows the full spectrum of the image; they follow the block spectra with cutting sequences s1 < 8, s2 < 6, s3 < 4 and s4 < 3. The second line shows the reconstructed images with the relative cut-off sequences mentioned
2.10 Transform Based on the Eigenvectors and Eigenvalues In many applications it is convenient to use linear transforms with the basis functions derived from the analysis of the eigenvalues associated with the matrices. The basic concepts of the eigenvalue problem are as follows. Consider an matrix A = {a jk } of dimensions N × N elements and given the following vector equation: Ax = λx
(2.90)
where λ is a real number and x is a N -dimensional vector. It is evident that the vector x = 0 is a solution of the Eq. (2.90) for any value of λ. A value of λ for which (2.90) admits a solution with x = 0 is called an eigenvalue (which derives from the German word Eigenwert, composed of eigen with the meaning of its own or specific and wer t means value) or characteristic value or root of matrix A. The solution vector x = 0 of (2.90) is instead called the eigenvector or characteristic vector of matrix A corresponding to that of the eigenvalue λ. The set of eigenvalues is called the spectrum of matrix A. The set of eigenvectors (including the zero) associated with an eigenvalue λ of matrix A form a vector space called an autospace or eigenspace of A corresponding to that of the eigenvalue λ. The problem of determining eigenvalues and eigenvectors of an matrix, is generally called in the literature the problem (or analysis) of the eigenvalue, or rather the algebraic problem of the eigenvalue to distinguish it from those for differential equations and integrals. The problem of the eigenvalue attributable to the Eq. (2.90) is immediately applied to define important linear transformations, using the eigenvectors as orthonormal basis vectors. Eigenvectors corresponding to distinct eigenvalues are in general a
96
2 Fundamental Linear Transforms
linearly independent set. In general, a matrix A of dimensions N × N elements has at least one eigenvalue and at most N distinct eigenvectors (real or complex). This is easily visible by rewriting the vector Eq. (2.90) in the following form of a homogeneous system of linear equations: (a11 − λ)x1 + a12 x2 + · · · · · · + a1N x N = 0 a21 x1 + (a22 − λ)x2 + · · · · · · + a2N x N = 0 ....................................................... a N 1 x1 + a N 2 x2 + · · · · · · + (a N N − λ)x N = 0
(2.91)
Equation (2.91) in matrix form becomes (A − λI) · x = 0
(2.92)
where I is the identity matrix of the same dimensions as A. It is known for Cramer’s theorem that (2.92), a homogeneous system of linear equations, admits a solution if and only if the corresponding determinant of the coefficients is zero, namely a11 − λ a12 . . . a1N a21 a22 − λ . . . a2N =0 (2.93) D(λ) = det(A − λI) = ... ... . . . ... aN 1 aN 2 . . . aN N − λ where D(λ) is called a characteristic determinant and the Eq. (2.93) is called the characteristic equation corresponding to matrix A. The development of this equation generates a polynomial of order N with unknown the eigenvalues λ. This leads to the following result: the eigenvalues of a square matrix A of order N are the roots of the corresponding characteristic Eq. (2.93). Calculated the eigenvalues of A it is possible to determine for each of them the corresponding eigenvectors from the equation system (2.91) where λ is the eigenvalue for which an eigenvector x is found. We can then generalize by indicating with λk with k = 1, 2, . . . , N the eigenvalues of a square matrix A of order N with the Eq. (2.93) results |A − λk I| = 0
(2.94)
The N eigenvalues λk will correspond to the set of the N eigenvectors that satisfy the Eq. (2.90) of the eigenvalue problem that can be rewritten as follows: Axk = λk xk
(2.95)
where with xk the N eigenvectors of matrix A are indicated, which as previously indicated form the orthonormal basis vectors. An eigenvector corresponding to an eigenvalue is determined to be less than a multiplicative constant c = 0, i.e., if x is an eigenvector of A, also c · x is an eigenvector of A, corresponding to the same eigenvalue. The xk eigenvectors can be calculated by solving Eq. (2.92) by various methods (for example, Gauss elimination method). It is shown that if λk are the N distinct eigenvalues of a matrix A of order N the corresponding eigenvectors xk form a set of linearly independent vectors and it follows that A has as basis vectors the xk eigenvectors in the N dimension space. In the following paragraphs, we will use some linear transformations based on the eigenvectors. Through the concepts of eigenvalues and eigenvectors, we want to
2.10 Transform Based on the Eigenvectors and Eigenvalues
97
find a possible basis (if it exists) with which a linear transformation (or a matrix) has a form as simple as possible, for example, a diagonal form. In essence, an important question to ask is the following: is there a basis vectors for which the matrix associated with a linear transformation has a diagonal form? The answer, even if not always affirmative, depends on the vectors of which the transformation matrix is constituted, which, as we shall see, must be eigenvectors. A linear transformation F, defined in a N -dimensional space, associated with an A F matrix, is said to be diagonalizable if there is a basis B (of N dimensions) in which the A F matrix in this basis is a diagonal matrix. Similarly, a matrix A is said to be diagonalizable if there is an invertible matrix P such that P−1 AP is diagonal. Given this, it is shown that a linear transformation F associated with the matrix A is diagonalizable if, and only if, A is diagonalizable. It follows, that the transformation F and the associated matrix A are diagonalizable, the basis vector that diagonalizes F corresponds to the columns of the matrix P such that P−1 AP is diagonal. The question posed above, if there are vectors that form the basis B in which a given linear transformation is represented by a diagonal matrix, is fully demonstrated whether the basis B is constituted of eigenvectors according to (2.95). The basic change of a linear transformation associated with matrix A in the new base B through the base change matrix P imposes the following relation: A = PBP−1 = P−1 BP
(2.96)
and the two matrices A and B are said to be similar if the non-singular matrix P exists. The two similar matrices prove that they have the same eigenvectors. If then it is verified that Pt P = I, with the columns of P orthonormal base vectors, the transformation is called unitary as described in the presentation of the linear transformations in the previous paragraphs. A matrix A of order N is diagonalizable if, and only if, it has N linearly independent eigenvectors. Moreover, the columns of the matrix P, for which P−1 AP is diagonal, are precisely the eigenvectors of A. The symmetric matrices (that is, At = A), are diagonalizable even if with nondistinct eigenvalues, and have the property of having eigenvalues with real values. A symmetric matrix A, of dimensions N × N , is useful to express it in quadratic form as xt Ax, where x = (x1 , . . . , x N ), and in terms of eigenvalues, as follows: λ1 λ2 t (2.97) A = PPt = P P . .. λN that is xt Ax = xt PPt x = yt y (2.98) having placed y = Pt x, and therefore, we have y1 t x Ax = [y1 , . . . , y N ] · · ... = λ1 y12 + · · · , +λ N y N2 yN
(2.99)
98
2 Fundamental Linear Transforms
and it is shown that the quadratic form xt Ax is positive for each value of x = 0 or equivalently that the eigenvalues of A are all positive.
2.10.1 Principal Component Analysis (PCA) It is a statistical method, to simplify data, based on the eigenvalues of the covariance matrix. The first works (1901) are due to Pearson [10], and in 1933, this method was developed by the psychologist Hotelling [11] who introduced it with the name of Principal Components Analysis (in some cases also known as Hotelling transform). Subsequently, Karhunen [12] and then Loève [13] developed analogous transformations to obtain from the continuous signals a set of significant and uncorrelated coefficients. All these transformations based on the analysis of the eigenvectors that generate uncorrelated significative coefficients of the original signal are called in the literature Principal Component Analysis (PCA) [14], although in the international literature it has recently been consolidated with the name KLT (Karhunen–Love Transform). Unitary transforms (with orthonormal basis) treated so far (DFT, DCT, DST, DHaT, DWT, Harr, and Slant) are independent of the input data and characterized by the relative transformation matrix that projects the data into a new space with the goal to concentrate information as much as possible in a few significant coefficients. Basically, with these matrices of orthogonal transformation, a rotation of the data was introduced in this new space and in relation to this rotation, the different results of each transformation were characterized. With the KLT transform, based on the concept of eigenspace, it is shown that it is possible to realize a unitary transformation with two fundamental characteristics: complete decorrelation of input data and maximum possible compression of their information content. A PCA transform has the general formula expressed by FPC A (u, v) =
N −1 N −1
S( j, k)A( j, k; u, v)
u, v = 1, 2, . . . , N
(2.100)
j=0 k=0
where S( j, k) is the input image and the Kernel matrix of transform A( j, k; u, v) must satisfy, according to (2.95) the following equation of eigenvalues: λ(u, v)A( j, k; u, v) =
N −1 N −1
K PC A ( j, k; l, m)A(l, m; u, v)
(2.101)
l=0 m=0
where K PC A ( j, k; l, m) indicates the covariance matrix of the image S( j, k) and λ(u, v) is a constant value for a fixed (u, v). The set of functions defined by the Kernel matrix in (2.101) are the eigenvectors of the covariance matrix K PC A and λ(u, v) represent the eigenvalues of the covariance matrix. The Kernel matrix of the transform cannot always be expressed explicitly. The PCA direct transformation FPC A , given by Eq. (2.100), can be expressed in vector form F PC A = A · S
(2.102)
2.10 Transform Based on the Eigenvectors and Eigenvalues
99
together with its inverse S = At F PC A
(2.103)
where A represents the Kernel matrix to transform the input image S( j, k) both of same size N × N . This matrix will have to satisfy the relationship of the eigenvalue problem (2.104) K S A = A where K S is the covariance matrix of size N 2 × N 2 of the image S, and matrix A is constructed by inserting in each column the eigenvectors of the K S covariance matrix. is the diagonal matrix (matrix with all zero with the exception of the diagonal elements) of the eigenvalues λ1 , λ2 , . . . , λ N 2 . In other words, the calculation of the FPC A transform is reduced to the diagonalization of the covariance matrix K S of the image S. According to (2.96) the matrix is obtained as follows: = A−1 K S A
(2.105)
where the eigenvalues of K S are the elements of the main diagonal of and the basis eigenvectors K S are inserted as rows of the matrix A. Recall that the covariance matrix represents a statistical property of the image, thought as derived from a stochastic process. In this context, the covariance matrix is calculated assuming M images S = {S1 , S2 , . . . , S M } each of N × N size. For the calculation of K S it is useful to consider the Si th image as a one-dimensional signal represented by the vector Si = (Si0 , Si1 , . . . , Si N 2 ), where Si j denotes the jth pixel of the Si image-vector. With these notations, the K S covariance matrix of dimensions N 2 × N 2 is defined as follows: M 1 1 Si Sit −μ0 μtS (2.106) K S = E (S− μ S )(S− μ S )t = (Si −μ S )(Si −μ S )t = M M i=1
with μ S = E {S} ∼ =
M 1 Si M
(2.107)
i=0
where μ S indicates the average vector with dimensionality of N 2 , while E denotes the symbol of the expected value. Recall that the elements of the main diagonal of K S are the variance of single images Si while all the other elements represent their covariance. For example, the element K S (i, j) indicates the covariance between the vectors Si and S j . The K S elements are real and the matrix is symmetric. The diagonalization of the K S matrix realized with the Eq. (2.105) calculates its eigenvalues λi , i = 1, 2, . . . , N 2 and the N 2 eigenvectors ai = (ai1 , ai2 , . . . , ai N 2 )t that are loaded as rows in the transformation matrix A. The PCA transform can be calculated centered with respect to the average values of the input images Si and the Eq. (2.102) can be rewritten in the following version: F PC A = A(S − μS )
(2.108)
100
2 Fundamental Linear Transforms
Normally the transformation matrix A is organized to have the eigenvectors ordered in such a way that the eigenvalues λi are ordered in descending order on the diagonal matrix after the diagonalization process. This implies that the new components of the image S transformed into F, are ordered in the latter in descending order of importance, with respect to the information content. Basically, the variance of the transformed components is decreasing. If the covariance matrix K S is separable, instead of solving the (2.105) for N 2 × N 2 matrices, the problem of the eigenvalues is reduced considering only two matrices of N × N , thus obtaining a lower computational complexity. Returning to the transform FPC A given by (2.108), the new F PC A vectors describe random vectors whose covariance is given by (2.109) K F = E (F PC A − μ F )(F PC A − μ F )t where the value of the mean μ F is zero, as can be shown by considering Eqs. (2.107) and (2.108). Indeed (2.110) μ F = E {F PC A } = E A(S − μ S ) = AE {S} − Aμ S = 0 It follows that the covariance matrix of the FPC A transform is expressed in terms of K S from the following relation: K F = E (F PC A − μ F )(F PC A − μF )t t t = E A(S − μ F )(S − μ F )t A (2.111) = AE (S − μ F )(S − μ F ) At t = AK S A Equation (2.111) informs us that the covariance matrix K F is a diagonal matrix in analogy to the Eq. (2.105), remembering that A is the eigenvectors matrix of K S and is orthogonal for which A−1 = At . Substantially, K F is the analogue of the diagonal matrix calculated for K S , and the main diagonal elements are the eigenvalues λk for k = 1, 2, . . . , N 2 . Since all the other elements of K F are zero, it follows that the components of FPC A are completely decorrelated. In this case, each eigenvalue λi represents the variance of the ith component of the image S in the direction of the eigenvector ai . In other words, the effect of the PCA transform eliminated all the correlations between images seen as stochastic data. With the PCA inverse transform (2.103) it is possible to reconstruct the image S considering the orthogonality of the eigenvectors and from (2.108) we obtain S = At F PC A + μ S = A−1 F PC A + μ S
(2.112)
The PCA transform, being a unitary transformation with the basis vectors of the orthogonal and normalized eigenvectors (ait ai = 1) of the K S covariance matrix, corresponding to the λi eigenvalues, satisfying the fundamental equation of the eigenvectors and eigenvalues (2.101), also realizes the condition of optimal transformation in terms of energy compactness, i.e., information content of input data. Furthermore, the energy contained in the input data is also completely preserved. In fact, the nor m of the input and output vectors remains unchanged S2 = F2 , as well as the energy E of before and after the transformation resulting E F = tr (K F ) = tr (At K S A) = tr (At AK S ) = tr (K S ) = E S
(2.113)
2.10 Transform Based on the Eigenvectors and Eigenvalues
101
having remembered the commutative property of the trace of square matrices and that the trace of the covariance matrices K S and K F represent the total energy of the S image before and after the PCA transform E F = tr (K F ) =
M i=0
V (FPC Ai ) =
M i=0
σ F2 PC A = tr (K S ) = i
M i=0
V (Si ) =
M
σ S2i = E S
i=0
(2.114) where with V the variance of the random variables S and F PC A has been indicated. The traces of the two covariance matrices correspond to the sums of the variances of the components of the input and output image, i.e., the sum of the dispersions along the coordinated axes and in particular with the dispersion optimally compacted by the PCA in the output principal components.
2.10.1.1 Summary Features of the PCA The peculiar characteristics of the principal components analysis are summarized as follows: 1. The goal of the PCA is to reduce the number of variables of interest in a set of independent or noncorrelated components (orthogonal components). 2. It analyzes all the variance of the variables of the input data (signal, images, …) and reorganizes them into the set of the output components of a number equivalent to the original variables. 3. The f ir st component captures most of the overall variance, the second contains most of the residual variance, and so on for the remaining components until all the variance in the input data is obtained. 4. The components are calculated as a linear combination of the input variables and these combinations are based on the weights (eigenvectors) derived from the covariance matrix of the input data. 5. The reconstruction of input data can only be done with the most significant components. 6. The PCA admits only one solution and depends on the input data. The PCA method can be considered as a method with unsupervised learning. 7. The PCA assumes the input data with a multivariate normal distribution and calculating the covariance matrix (symmetric) are derived the associated eigenvalues and eigenvectors which, from a conceptual and geometric point of view, are considered the tools to evaluate the importance of the new coordinated axes in which input data are projected. In particular, each eigenvalue determines the importance in terms of variance of an axis (component) represented by the associated eigenvector that determines the new axis direction in the new space (see Fig. 2.12). 8. The new axes defined by the (normalized) eigenvectors space are called principal axes and the new output variables are called the principal components of the input variables. 9. From a geometrical point of view, the rotation of the new axes coordinated in the eigenspace does not introduce an alteration of the form and dimension of the distribution of the output data with respect to the original one. Basically, the nor m of input and output vectors is maintained (see Fig. 2.12).
102
u
FPCA=AS var(u)=max v
var(v)=min μ(s ) 0
Fig. 2.12 Geometric representation of the PCA transform which corresponds to a rotation of the new coordinated axes (u, v) (the principal components) oriented in a direction parallel to the related eigenvectors a1 and a2
2 Fundamental Linear Transforms
0
μF(u)=μF(v)=0
μ(s )
2.10.1.2 Essential Steps for Calculating PCA Components Some PCA applications will be described below. We summarize the essential steps for the calculation of the principal components, according to the context and symbols used in the previous paragraphs. 1. Deter mine the mean vector μ S and K S covariance matrix of the input data which, for the sake of simplicity, are represented by a S matrix of N 2 × M dimension. Without loss of generality, the input data can also represent a set of M images each consisting of N 2 pixels. 2. Deter mine the N 2 eigenvalues λi associated with the K S covariance matrix of N 2 × N 2 dimensions ordered in descending order. 3. Deter mine the N 2 eigenvectors ai corresponding to the N 2 eigenvalues determined in the previous point. Thus the transform matrix A is obtained with which to project the input data into the new components space. 4. Per f or m the PCA transform with the (2.108) thus obtaining the new components of the input data. 5. Analyze the significance of the new components and in relation to the type of application begin the phase for the selection of the principal components. Normally the informational content is in the first components, depending on the nature of the data if they are at the origin strongly correlated or not. Choosing the first Q components with Q < M results in a decrease in data dimensionality equal to Q Q 2 i=1 σ Si i=1 λi = M (2.115) M 2 i=1 λi i=1 σ Si 6. Rer un the direct transformation of the PCA with the (2.112) once the transformation matrix of dimensions N 2 × Q has been built with the Q eigenvectors associated with the first Q principal components chosen in the previous point. In this way the desired data transformation is obtained with a loss of the information content of a factor chosen in the previous point by choosing the first Q principal components.
2.10 Transform Based on the Eigenvectors and Eigenvalues
103
7. Per f or m the inverse transformation of the PCA with the (2.112) for the approximate reconstruction of the Sˆ Q data once the inverse transformation matrix A−1 of size N 2 × Q has been constructed with the Q eigenvectors associated with the first Q principal components chosen. The information content lost with the reconstructed data results in percentage M i=Q+1 λi 100 · M (2.116) i=1 λi which corresponds to the mean squared error M S E given by M M M 1 2 2 ˆ (Si − Si ) = E FPC Ai = λi (2.117) E MSE = M i=1
i=Q+1
i=Q+1
The following four paragraphs will describe some applications of linear transformations based on eigenvectors and eigenvalues. The first two applications will concern the use of principal component analysis, respectively, for image compression, for the calculation of the main axes of an object, and for the reduction of the dimensionality in the context of multispectral images. In the fourth application the concept of eigenspace generated by the eigenvectors for automatic face recognition will be used.
2.10.2 PCA/KLT for Data Compression The PCA/KLT transform is shown to be optimal for removing the strong correlation of data, particularly in images, where the information (intensity) between adjacent pixels is very redundant. The described transforms (DCT, DHaT,…) in the preceding paragraphs, define the transformation matrix independently of the input data. The PCA/KLT transform [15] calculates the basis in relation to the data observable by a stochastic process by diagonalizing the covariance matrix. Therefore, in the case of signals or images conceived deriving from a stochastic process, it is possible to estimate the covariance matrix for the entire image or signal or for parts of them. In the latter case, an acceptable estimate of the covariance matrix requires adequate representativeness of the data. A covariance matrix can be estimated independently of the input data [16] when considered derived from a Markovian first-order stationary process (also called Markov-1). In this case, by process definition Markov-1, for a sequence N of data x the covariance matrix has the following formula: K i j = σ 2 ρ |i− j|
1 ≤ i, j ≤ N
(2.118)
where 0 < ρ < 1 is the autocorrelation coefficient that normally varies in the interval [0.95 ÷ 0.99]. There is an analytical solution for determining the eigenvalues of the covariance matrix K defined by (2.118). An autovettore wi j is given by π (N + 1) 2 + ( j + 1) (2.119) wi j = sin r j (i + 1) − N + μj 2 2
104
2 Fundamental Linear Transforms
where μ j is the jth eigenvalue calculated with the following expression: μj =
1 − ρ2 1 − 2ρ cos(r j ) + ρ 2
(2.120)
with r j which is the real and positive root of the following transcendental equation: (1 − ρ 2 ) sin(r ) (2.121) cos(r ) − 2ρ + ρ 2 cos(r ) The demonstration of the analytical solution is given in [16]. An image can be considered acquired with a Markov-1 process and the covariance matrix can be modeled with the (2.118) which in the extended form results 1 ρ ρ 2 . . . ρ N −1 ρ 1 ρ . . . ρ N −2 (2.122) K = σx2 · . .. .. . . .. .. . . . . ρ N −1 ρ N −2 ρ N −3 . . . 1 tan(Nr ) = −
The basis images of the PCA/KLT transform associated with the eigenvectors deriving from the covariance matrix (2.122) are shown in Fig. 2.13. They were calculated with an autocorrelation coefficient ρ = 0.9543 and with N = 8. The basis images of the PCA/KLT tend to be identical to those of the cosine transform (with the same sinusoidal shape) for values of ρ → 1. The basis are ordered with the increasing variance starting from the top left representing the lower values in black. The Markov-1 approach, used to calculate the covariance matrix,
Fig. 2.13 The basis images of the PCA/KLT for ρ = 0.9543 and with N = 8
2.10 Transform Based on the Eigenvectors and Eigenvalues
105
PCA / KLT transform based on Markov-1 statistics
a) Covariance matrix - Markov-1 c) Full spectrum
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
e)
g)
f)
h)
0
0.8 0.6 0.4 0.2 0
b) Co-occurrence matrix dist.8px d) Covariance matrix after PCA/KLT
Reconstructed image
1
Fig. 2.14 Application of PCA/KLT transform with a the covariance matrix computed with Markov1 statistics; b the co-occurrence matrix to verify the correlation of the adjacent pixels of the test image at a distance of 4 pixels horizontally; c the transform spectrum in which we see the concentration of information in a few coefficients (red zone); d the covariance matrix after the transformation; e the spectrum of the transform with the zeroed coefficients (blue zone) equal to 22%; f the reconstructed image with a loss of information equal to 2.3%; g the spectrum with the zeroed coefficients equal to 80.5%; h the rebuilt image with a loss of 5.7%
assumes that the nature of the data has the autocorrelation characteristic according to the correlation model defined by (2.118). Moreover, the test image used the same value of the autocorrelation coefficient for adjacent horizontal and vertical pixels (ρ H = ρV ), i.e., with separable and isotropic distribution, has been assumed. In fact, in Fig. 2.14a and b are, respectively, shown the covariance matrices relative to the Markov-1 model, and the co-occurrence matrix obtained from the test image to verify the correlation of adjacent pixels. We note the agreement between the predicted autocorrelation model and the one of the verified test image up to the distance of 8 pixels. The figure also shows the results of the PCA/KLT based on the Markovian covariance matrix. In Fig. 2.14c and d the spectrum and the covariance matrix are shown, respectively, after the PCA/KLT transform. It highlights the high concentration of information content (in red) and the decorrelation of transformed data. The elements external to the diagonal are with very low values. The following columns show the reconstructed images by eliminating the least significant coefficients in the percentages identical to those applied with the DCT (Fig. 2.4). In fact, reconstructed images are obtained with identical error (calculated with Eq. 2.117) rms (see Fig. 2.14 from e to h). In Fig. 2.15 instead the results of the image reconstruction are shown using 1 to 4 principal components. The covariance matrix used is always the same, the Markovian matrix. The second line shows the results of the PCA/KLT applied by decomposing with 8 × 8 block the image as done for the DCT. From the analysis of the results,
106
2 Fundamental Linear Transforms
Full image 256x256
PCA / KLT transform based on Markov-1 statistics
N.EigV=75, Loss:3,3%, rms:5,9% N.EigV=50, Loss:5,2%, rms:8,6%
8x8 blocks
N.EigV=128, Loss:1,6%, rms:6,3% N.EigV=100, Loss:2,3%, rms:6%
N.Eig=4, Loss:1,8%, rms:5,7%
N.EigV=3, Loss:2,8%, rms:8,3%
N.EigV=2, Loss:4,9%, rms:10,1% N.EigV=1, Loss:12%, rms:11,2%
Fig. 2.15 Application of the PCA/KLT transform with the covariance matrix always calculated with Markov-1 statistics. The first row shows the results of the reconstruction by working on the entire image and using only the most significant components that decrease from left to right, thus increasing the loss of information. The second row shows the results obtained by performing a decomposition of the image with blocks of 8 × 8
in terms of information lost, the two transformations exhibit the same performances despite using the Markovian covariance matrix. Finally, in Fig. 2.16, the results of the PCA/KLT are shown, this time the basis of the transform are calculated with the covariance matrix using the test image statistic. The first column in the figure shows the image covariance matrices, computed for rows and columns, and the covariance matrix computed after the transform. It is noted that the elements of the first two matrices have many scattered elements with very high correlation values, so the image has very correlated pixels. On the contrary, in the covariance matrix, after the transform, all elements outside the main diagonal are zero-valued (blue) demonstrating that all components of the PCA/KLT are fully decorrelated and the entire information content is represented by the first principal components. In particular, from the analysis of the results and as expected, with the image decomposed in blocks the image reconstruction with only 1 eigenvector the information lost (estimated with Eq. 2.116) is 2.17% (r ms 6.5%), with 2 eigenvectors the loss is 0.37% (r ms 4.66%) and with 3 eigenvectors the loss is 0.2% (r ms 4.16%). The transformation over the entire image of 256×256 has significantly better results. To obtain the same deformation value, measured by the r ms, are required for the image reconstruction 25, 50, and 75 principal components on 256, respectively. The performances also depend on the size of the blocks chosen and on the level of intrinsic correlation of the image. Big blocks guarantee better decorrelation, and therefore, a better coding (an aspect that will not be considered in this context) that is strategic
2.10 Transform Based on the Eigenvectors and Eigenvalues
107
1 0.8 0.6 0.4 0.2
Full image 256x256
Image covariance matrix for columns for rows
PCA / KLT transform based on image statistics
0
8x8 blocks
Cov. matrix after transformed
N.EigV=75, Loss:0,16%, rms:4,2% N.EigV=50, Loss:0,34%, rms:4,4% N.EigV=25, Loss:0,8%, rms:5,9%
N.EigV=3, Loss:0,19%, rms:4,2%
N.EigV=2, Loss:0,38%, rms:4,6%
N.EigV=1, Loss:2,2%, rms:6,5%
Fig. 2.16 Application of PCA/KLT transform with the covariance matrix calculated directly from the image statistic. The first column shows the covariance matrices of the image calculated first by rows and then by column. The covariance matrix after the transformation is also shown. The following columns show the results of the reconstruction using only the most significant components that decrease from left to right, thus increasing the loss of information. The transformation was performed on the whole image (first row) and also by decomposing it into blocks of N × N (second row)
in the applications of compression and data transmission. On the other hand, the computational load increases linearly with the number of pixels per block, while the size of the covariance matrix increases with the square of the number of pixels. It, therefore, achieves an additional computational load for the calculation of eigenvectors and eigenvalues, also considered the need to have an adequate number of samples for a significant covariance matrix. In fact, while it is guaranteed the decorrelation within the block with the various transforms, no guarantee we have to obtain it between blocks. One way to mitigate the problem of adjacent block correlation is to encode only the difference between adjacent coefficients after encoding the first (method known as DPCM-Differential Pulse Code Modulation). The size of the blocks, therefore, requires a compromise between the level of decorrelation to be achieved and computational complexity. Normally block-level statistics tend to be equivalent. It follows, that you can use the statistics of some blocks of an image as prototypes of the entire image and use them to compute the PCA/KLT for other blocks or even for other images. In Fig. 2.17 we have an example of image (b) before transformed with the calculated basis of another image (a), and then rebuilt without generating significant distortions if you use all the principal components. Using 200 components the image (b) is fully rebuilt but appears in the background of the texture, a 10% r ms error and a loss of 0.13%. Summarizing, it can be said that the orthogonal transforms tend to decorrelate the input data (signals, images, …) while the PCA/KLT represents the optimal transform that decorrelates them completely. Another characteristic of these transformations
108
2 Fundamental Linear Transforms PCA / KLT transform based on the components of another image
a) Image from which covariance b) Reconstructed image from the c) N.EigV=200, Loss:0,13%, rms:10% matrix is calculated
components of a) N.EigV=256, Loss:0 rms:0
Fig. 2.17 Application of PCA/KLT transform that reconstructs an image with the principal components computed with the covariance matrix of another image. a Image with which the covariance matrix is calculated; b the image completely different from (a) rebuilt with the components of the (a); c Image (b) rebuilt with the first 200 principal components of (a)
is that of concentrating in a few coefficients or components the majority of the information content (i.e., energy in the case of signals) of the input data. From the results shown, the PCA/KLT exhibits good performance. DCT tends to achieve almost optimal results, as well as those of PCA/KLT. Obviously, the performance of all transforms depends on the level of correlation of the input data. Normally the signals and the images have the natural characteristic of having high redundancy depending on the physics of the sensors. For example, a multispectral image, composed of several images, although each acquired with different sensors, sensitive to certain bands with a wavelength from the visible to the infrared, de facto, exhibit a strong correlation between them. In this case, the PCA/KLT is essential to extract the most significant components completely decorrelated. Despite the optimum characteristics of decorrelation and compactness of energy, the PCA/KLT is not used for data compression and transmission for the required computational complexity and the fact that the basis are calculated from the covariance matrix, which, in turn, depends on the statistic of input data, thus requiring the existence of a sufficiently adequate number of samples. The other orthogonal transforms are preferred, for the compression normally the DCT, of which there is also a fast algorithm thus reducing the complexity of the transform from O(N 2 ) to O(N log2 N ), often realized also in firmware. In essence, the PCA/KLT is used in this context as a benchmark transformation to evaluate the performance of other orthogonal transforms. Instead, it finds notable applications in other fields such as automatic recognition of objects or classes of objects (Pattern Analysis and Machine Intelligence), feature extraction, texture analysis, etc.
2.10 Transform Based on the Eigenvectors and Eigenvalues
109
2.10.3 Computation of the Principal Axes of a Two-Dimensional Object In several image processing applications, it can be useful to determine the principal axes of the shape of an object whose orientation is unknown. Figure 2.18 shows an object that is randomly oriented relative to the coordinated axes (s1 , s2 ). We want to calculate the center of mass and the orientation of the principal geometric axes. If you want to use the PCA transform, the coordinates of each pixel of the S object are treated as random variables (s1 , s2 ) of which it is possible to calculate the mean and the covariance matrix N N 1 i 1 i s1 μs2 = s2 μs1 = N N i=1 i=1 N 1 t ∼ KS = Si Si − μ S μtS N i=1
where N indicates the number of pixels of the object S whose shape has already been detected in the binary image, Si (s1i , s2i ) is the pixel ith of the object S = {S1 (s11 , s21 ), S2 (s12 , s22 ), . . . , S N (s1N , s2N )} seen as a vector represented by its coordinates (s1i , s2i ). The K S covariance matrix is of size 2 × 2. The calculation of the principal geometric axes of the object coincide with the principal axes of the PCA transform, and it is known that they correspond to the K S eigenvectors. In other words, the K S eigenvectors are aligned in the direction perpendicular to the maximum geometric dispersion of the pixels, i.e., in the direction of maximum variance (considering that the pixel coordinates (s1i , s2i ) are assumed as random variables). Moreover, the eigenvectors have the constraint of orthogonality. Calculated the eigenvectors A of the K S covariance matrix with (2.108) we have the new coordinates (u, v) of the object S with respect to its center of mass (centroid) μ S obtained by applying to each point Si (s1i , s2i ) the Eq. (2.108) that we rewrite F PC A = A(S − μS )
(2.123)
Recalling the classical rotation equation of an image (3.27), introduced in the Chap. 3 Geometric transformations, it can be observed that it coincides with (2.123) by matching the rotation matrix with the matrix of the eigenvectors A a11 a12 cos θ sin θ (2.124) = A= a21 a22 − sin θ cos θ where θ is the angle of inclination of the new axis u (first principal axis of the PCA transform) with respect to the origin axis s1 (see Fig. 2.18).
2.10.4 Dimensionality Reduction The PCA transform is also used to reduce the dimensionality of the output vectors [2, 17]. Recalling that the rows of the transformation matrix A represent the eigenvectors,
110
2 Fundamental Linear Transforms
u
u
0
μ(s )
M ax im um
Va ria nc e
v
um im ax em g th nin of ai on em cti e r re nc ide Di ria ons va t c no
Fig. 2.18 Application of PCA/KLT transform to detect the orientation of the principal axis of the shape of an object in a binary image
θ
0
μ(s )
eliminating some of these, for example, considering the transformation matrix B of dimensions N 2 × Q with Q < M obtained by eliminating M − Q eigenvectors of A, the Eqs. (2.102) and (2.103) become, respectively, the following:
and
Fˆ PC A = BS
(2.125)
Sˆ = Bt Fˆ PC A
(2.126)
In other words, the PCA transform expressed by Eqs. (2.125) and (2.126) reduces the dimensionality of the output space to Q components by introducing an estimable error based on the information content of the eliminated components that have a weight approximated to their variance, i.e., the eigenvalues λi for i = Q + 1, . . . , M corresponding to the eliminated eigenvectors. The approximate mean squared error (MSE) (described in Sect. 6.11 Vol. I) would result err M S E =
M
λi
(2.127)
i=Q+1
For the reasons expressed in the preceding paragraph, discarding the last components, this error can be very low until it is negligible. For example, in the case of multispectral images, in fact, the first two components contain 95% of the information with an error of 5%. The principal components are not used for compressing information, but in many image processing applications, particularly in the process of object recognition, to significantly to select the significative features of objects, thus eliminating the redundant information.
2.10 Transform Based on the Eigenvectors and Eigenvalues
111
An effective application of the PCA is in the elaboration of multispectral and multitemporal images for the aspects of visualization and classification of the territory. For example, the multispectral images normally composed of different components are reduced to three, the most significant, to properly simulate the primary colors: red, green, and blue (see Chap. 3 The Color in the Vol. I). The reduction of dimensionality in the context of multispectral images is described in the following paragraph.
2.10.5 Calculation of Significant Components in Multispectral Images Another application of the PCA transform [17] concerns the selection of representative data that best characterize an object in particular when a high abundance of acquired samples are available. For example, in remote sensing applications, from a satellite or an airplane the same geographic area is observed by acquiring M images corresponding to different spectral bands (in the visible may be the band of red or green; or the near-infrared, etc.). In this case we speak of multispectral images S = (S1 , S2 , . . . , S M ) and the pixels of the ith band are represented by the vector Si (x, y) = {s1 (x, y), s2 (x, y), . . . , s M (x, y)} where (x, y) indicates the position of the pixel in the multispectral image and the index i = 1, . . . , M indicates the spectral band (or component). Therefore, a pixel, in the position (x, y) in a multispectral image, has M values, as many as the sensitive sensors corresponding to the M spectral bands (see Fig. 2.19). From a physical point of view, the M sensors are designed to give each one its own measurement (spectral signature) of the same area of the observed territory. We are interested in knowing how the pixel values that are in the same location (x, y) in the multispectral image are correlated with each other, and which are, Fig. 2.19 Organization of the data of a multispectral image with dimensions N × N formed by M spectral bands with the generic pixel Si (x, y), i = 1, 2, . . . , M representing a random variable of size equal to the M number of the bands
x y
,y)
M
(x
,y)
)
Multispectral image
(x
Band SM
S( x
,y)
...............
B
112
2 Fundamental Linear Transforms
among them, the most significant bands that best characterize the territory. In this context, the PCA transform is strategic to define the performance and discriminating characteristics of the various multispectral sensors. To apply the PCA transform we assume one pixel of the multispectral image S as a random variable of M dimensions and the multispectral image as a population consisting of N × N pixels corresponding to the image size ( Fig. 2.19). For clarity, a Si (x, y) component of the multispectral image is called a band to not confuse it with the components of the PCA transform. Applying the PCA transform we will have M new orthogonal components, that are not correlated, and we could easily choose the most significant images by analyzing the variance of each band. Basically, with the PCA transform, the M multispectral bands are projected into the new space always at M dimensions, of which, the first component results with greater variance and is the most significant, while the others, with decreasing variance, have the content informative remaining, always less, with the last components characterized by the dominant noise. Equation (2.123) in this case would be interpreted as follows: 1. A represents the transformation matrix that corresponds to the matrix of the eigenvectors derived from the K S covariance matrix of dimensions M × M calculated with the (2.106); 2. Si , i = 1, . . . , M are the bands of the multispectral image of identical pixel dimensions N × N and co-registered between them; 3. μ S is the vector representing the expectation value of S, calculated with (2.107), i.e., the average of each component image (spectral band), approximation of the expected value, assuming that each pixel value is equally probable; 4. F PC A is the vector of the PCA transform. In this context, the transformation matrix A reports in each row the normalized eigenvectors of the K S covariance matrix of the original M bands. The covariance for this example measures the level of correlation between the bands. Each image is acquired with sensors that have a dominant spectral sensitivity in a certain band of the spectrum (for example, in the case of multispectral images for the observation of the Earth’s surface, acquired from satellite or plane, the first band corresponds to green, the second to red, and so on). For example, the K S (1, 2) element of the covariance matrix expresses a correlation value between the first and second bands that are normally strongly correlated (i.e., high value of K S (1, 2)). This can be explained by the technical difficulty of producing sensors with very narrow spectral responses that correspond only to a certain band. It follows that, points of the scene observed with very high brightness values will have very high pixel values at the correlated bands. Vice versa, if a pair of bands (i, j) is not very correlated, the covariance K S (i, j) will have very low values up to zero, when the two bands are completely orthogonal. We already know that this happens with the PCA transform where the values of the new components are orthogonal, and therefore, linearly independent. In fact, if you calculate the covariance matrix for the components of the transform, this would result in a diagonal matrix, with only the
2.10 Transform Based on the Eigenvectors and Eigenvalues
113
diagonal nonzero elements representing the variance of each component (band), and the other elements, with zero value (2.111), indicate the complete decorrelation of the new components. If instead, the eigenvectors of the transformation matrix are normalized, the covariance matrix of the transform corresponds to the identity matrix (with the elements of the main diagonal to 1 and the others to zero). In Fig. 2.20, two multispectral images are shown, indicated with S1 and S2, with 6 bands (coverage from the visible to infrared) acquired by the Landsat satellite for the monitoring of the territory. The images chosen concern two types of territory, one urban and the other of cultivated land with a dominant different vegetation. The correlation matrices 6 × 6 of the two types of multispectral images are shown in Fig. 2.21. It can be observed how the various bands show a different correlation between them when the type of territory varies. In particular, for the image S1 (inhabited center with the sea) some pairs of bands show a high correlation between them while, for the image S2, with dominant vegetation, different combinations of bands have a lower correlation, due to the different sensitivity of the sensors to different vegetations. Applying the PCA/KLT transform results in totally uncorrelated components, confirming that the components corresponding to the two images exhibit a very different variance. In fact, from Fig. 2.20 can be observed as for the image S1,
S1
C.P. N:1 Var: 93.33
PCA/KLT transform: Principal Components Analysis C.P. N:2 Var: 4.29
Original Bands: N:1
S2
C.P. N:1 Var: 73.23
C.P. N:2 Var: 19.82
Original Bands: N:1
C.P. N:3 Var: 1.73
N:2
C.P. N:3 Var: 4.82
N:2
C.P. N:4 Var: 0.26
N:3
N:4
C.P. N:4 Var: 1.21
N:3
N:4
C.P. N:5 Var: 0.23
N:5
C.P. N:5 Var: 0.55
N:5
C.P. N:6 Var: 0.15
N:6
C.P. N:6 Var: 0.35
N:6
Fig. 2.20 Application of PCA/KLT transform for the determination and analysis of the significant bands of a multispectral image. For the two satellite images S1 and S2, the 6 bands and the six principal components from left to right are shown in order of decreasing variance. It is already observed on a qualitative level how the information content decreases from the first to the last component. In particular, for the S1, with a more homogeneous territory, the first principal component contains 93% of all the information related to the 6 bands. The S2 instead, with a less homogeneous territory (different types of crops), the first component has 73% of the information. The other components gradually decrease in content and can be neglected
114
2 Fundamental Linear Transforms Correlation matrix: image S1
B
0.9325 0.9094 0.6837 0.7299 0.7890 0.9325 1.0000 0.9641 0.8315 0.8672 0.8979 0.8749 0.9148 0.9381 0.9447 0.9050 0.9787
Correlation matrix: image S2 1.0000 0.9370 0.9421 -0.0356 0.8001 0.8125
0.9370 1.0000 0.9287 0.0078 0.8274 0.8359
0.9421 0.9287 1.0000 -0.1465 0.8460 0.8777
-0.0356 0.0078 -0.1465 1.0000 -0.0149 -0.2429
0.8001 0.8274 0.8460 -0.0149 1.0000 0.9146
0.8125 0.8359 0.8777 -0.2429 0.9146 1.0000
Fig. 2.21 Correlation matrices related to multispectral images S1 and S2 of Fig. 2.20
the first principal component explains 93% of the variance, the second component is 4.3% that is, these first two principal components would be sufficient to represent all the information content. The first principal component of the other image instead explains only 73% of the variance, and the second almost 20% of the remaining variance. This different weight, of the principal components of the two images, depends precisely on the sensitivity of the sensors, influenced by the different energy emitted by the territory, as vegetation changes. It should also be noted that for both images, the 4 remaining components are few significant, representing just 5% of the information content, and the noise of the sensors is evident. Basically, with the PCA transform the correlation between the bands is completely eliminated, and in this application context, the PCA proves very effective for the aspects of reduction of dimensionality, especially when the number of bands is high, as happens using the most recent satellites. This feature of the principal components is used, in particular, to simplify and improve the image classification process (see paragraph Object recognition Chap. 1 Vol. III). Another effective use is the visualization of a multispectral image considering the high number of bands. Figure 2.22 shows a false-color representation of the two multispectral images S1 and S2, combining as RGB components the first three principal components of each image representing, respectively, the information content of 99.2% and 97.2%.
2.10.6 Eigenface—Face Recognition The eigenface indicates an appearance-based method for automatic face recognition that seeks to capture significant variations in a set of facial images (acquired under different lighting conditions and facial position) and to use this information to encode and compare the images of the individual faces with a non-traditional approach, not based on the extraction of the features of parts of the face. In essence, eigenfaces are the principal components of a set of faces, i.e., they are the eigenvectors of the covariance matrix of the set of facial images, in which an image with N × N pixels is represented by a vector (which represents a point) in a N 2 size space. The eigenface method that uses the principal components to recognize human faces was conceived by Turk and Pentland in 1991 [18]. Subsequently, several other faster approaches have been developed and the eigenfaces is often used as a comparison method.
2.10 Transform Based on the Eigenvectors and Eigenvalues
115
i
Fig. 2.22 Realization of RGB images using the first three principal components related to the two multispectral images of Fig. 2.20 representing, respectively, 99.2% and 97.2% of the information content of each of the multispectral images
Fig. 2.23 Database of the 40 faces of the archive AT&T
We now describe the method with a concrete example. Suppose we consider M = 40 faces of the AT&T2 database as in Fig. 2.23, in which, for example, we have 40 individuals and for each of them 10 facial images have been acquired. In Fig. 2.24 are shown, of the 40 faces of the AT&T archive, the first 5 of the 10 images taken of the first 4 individuals. Our training set or database is, therefore, characterized by M = 400 images, each with a size of N × N .
2 The
ORL Database of Faces is an archive of AT&T Laboratories Cambridge that contains a set of faces acquired between 1992 and 1994 to test algorithms in the context of face recognition.
116
2 Fundamental Linear Transforms
Fig. 2.24 For each subject, 10 different images are acquired as the pose changes, the first 5 of the first 4 subjects are displayed
We are interested in finding a base that can best represent individual subjects. Here we must distinguish two aspects of representation. One can be bound to find a compact notation in order to faithfully reconstruct the image as in the case of a stream of video images; or a very compact notation in which it is necessary only to distinguish classes of objects as in the case of this example in which we are interested in solving a problem of recognition of an individual’s identity (for example, while crossing a gate to check whether it is admitted or for to detect its presence). We want to find a set of vectors uk of M orthonormal vectors that best describes the distribution of faces. In other words we want to formulate the problem of eigenvalues in the space of the dimensions M equal to the number of the sample images and not based on the dimensionality of the sample images if N >> M. In fact, if we consider the M sample images (seen as a set of prototypes of faces) of 256 × 256 pixel size the associated covariance matrix would be of considerable size equal to 65536 × 65536 with a corresponding number of eigenvectors of 65536. We also know that less than about 5% of these eigenvectors are representative. The problem of dimensionality is solved with the following algebraic artifice (Turk and Pentland). Each of the M sample images is organized into a N 2 × 1 vector which becomes a column of the X matrix of dimensions N 2 × M. According to the equation of eigenvalues (2.95) we rewrite, indicating with vk the eigenvectors (XXt )vk = λk vk
(2.128) N 2 × N 2 ) of the matrix containing
t
where XX represents the covariance matrix (large all the prototype X images organized in the columns as indicated above. Multiplying both members for Xt and rearranging you get Xt (XXt )vk = Xt λk vk (Xt X) (Xt vk ) = λk (Xt vk ) ! "# $ ! "# $ uk
(2.129)
uk
from which it emerges that the Xt vk vectors are the eigenvectors of the Xt X matrix with the advantage of being of dimensions M × M according to the objective of reducing the dimensionality. If we indicate with uk = Xt vk the eigenvectors of (Xt X), these are then related to the eigenvectors vk of the covariance matrix (XXt ) of the X set of the prototype images of the faces, from the relation vk = Xuk
(2.130)
2.10 Transform Based on the Eigenvectors and Eigenvalues
117
According to (2.129) it can be argued that the covariance matrices (Xt X) and (XXt ) both have the same eigenvalues and eigenvectors related to (2.130). In particular, the M eigenvalues and eigenvectors associated with (Xt X) correspond to the M eigenvalues and eigenvectors associated with the matrix (XXt ) with higher values. Such eigenvectors are called eigenfaces. Each face, both the prototype and the recognizable, is expressed as a linear combination of these eigenfaces. The calculation of eigenfaces includes the following steps: 1. Acquir e the M images of the prototype faces of the N × N dimension organized as columns of the matrix X (i, j), i = 1, . . . , N 2 ; j = 1, . . . , M. 2. Calculate the average face, which is given by the image following:
=
M 1 Xi M
(2.131)
i=1
and subtract the average from each original image and store the result in the new variable i = 1, . . . , M (2.132) i = X i −
Thus we have the new = {1 , 2 , . . . , M } matrix of N 2 × M dimensions, whose columns are the averaged vectors of the set of prototype faces. 3. Calculate the covariance matrix (t ) of dimensions M × M CM =
M 1 t i i M i=1
4. Select the M eigenvalues λi and eigenvectors ui associated with the covariance matrix (t ). 5. Calculate the M eigenvectors related with the covariance matrix (t ) with the (2.130) and normalize the eigenvectors with the same norm as the unit. The M eigenvectors of C M = (t ) are used to find the M eigenvectors of the covariance matrix (t ) that form our eigenface prototypes vk =
M
uk k
k=1
where vk are the eigenvectors, i.e., eigenfaces. From the analysis of the eigenvectors obtained, it is best to keep only the K eigenvectors that correspond to the larger K eigenvalues (highest eigenvalue reflects the highest variance). Eigenfaces with low eigenvalues can be omitted, as they explain only a small part of the features characteristic of the prototype faces. 6. Pr ojects the K prototype faces in the space of the eigenfaces represented as a linear combination of the K eigenvectors. They form the new eigenvector matrix V, so that each vector vk is a column vector. The size of the V matrix is reduced to N 2 × K where K < M are the most significant eigenvectors. Therefore, if we indicate with F PC A the matrix that indicates the output of the projection of the
118
2 Fundamental Linear Transforms
Average Face
First Eigenface
Second Eigenface
Fig. 2.25 Image of the average face calculated in step 2, and the first and second eigenfaces of the 400 generated with step 5 from the archive images AT&T
prototype faces we obtain the weights of each prototype face with the following PCA transform (2.133) F PC A = Vt · where the columns of the F PC A = {F1 , F2 , . . . , FK } matrix represent the K prototype faces in the K dimensional eigenface space. 7. Face recognition. The last step is the recognition of faces. The image of the person we want to find in the set of K eigenfaces is transformed into a column vector Y , reduced by the mean value and then projected into the eigenface space, according to Eq. (2.133), as follows: ωY = vkt (Y − )
k = 1, 2, . . . , K
(2.134)
where ωY represents the features of the Y face to be recognized in the eigenface space. The face recognition is achieved by finding the minimum Euclidean distance k between ωY and each eigenface vectors vk present in the set of eigenfaces V. The face Y is recognized only if the minimum distance between ωY and the eigenface vectors vk is below a certain threshold (determined experimentally) otherwise it is unknown. Each original face image and the test one can be reconstructed by adding mean image to the weighted summation of all eigenface vectors V using the inverse PCA transform, as follows: ˆ = V · F PC A +
(2.135)
ˆ is the face vector reconstructed. where In Fig. 2.25 are shown the images of the average face of the AT&T database together with the first and second eigenface of the M = 400 generated with step 5. Figure 2.26 shows instead the prototype faces reconstructed using the first 50 most significant eigenfaces that correspond to 81.84% of the overall information. Let’s now look at the recognition phase of a face. Given an unknown face, acquired under the same conditions as prototypes with an image of identical size and scale, for its identification, that is, the archival search of a face similar to one of the prototype
2.10 Transform Based on the Eigenvectors and Eigenvalues
119
Fig. 2.26 Reconstruction of the first face of the 40 subjects using only the K = 50 eigenfaces most significant of the 400 available. These eigen f aces represent 81.84% of the overall information content
faces, some of the previous steps must be performed. In particular, the normalization and projection of the unknown face is performed, with the (2.133), in the eigenface space (step 6), calculating the coefficients with the PCA transform. Next, we calculate the Euclidean distance in the eigenface space by comparing the unknown face with all the prototype faces and considering, the most likely one, the prototype with a lower Euclidean distance within a predefined confidence threshold. Finally, with the (2.135) the identified face is rebuilt if it exists. Figure 2.27 illustrates in (b) the results of the faces recognition considering some already present in the archive and three faces not present among the prototypes, the latter reported in (a). By giving an already existing face among the prototypes as input, once projected into the eigenface space, it is compared with all the prototypes and the similar ones are recognized by evaluating the Euclidean distance. The first 2 faces are shown that have a minimum distance that normally belongs to the same subject seen from different positions or with different visual expressions. The images in the first rows of Fig. 2.27b show the result of the recognition and the Euclidean distance is reported as a measure of similarity. The images of the last three rows show the result of the search for faces of subjects not included in the prototypes and as we can see the distance Euclidea is very high not finding correctly similar faces. These faces, not included in the archive, were projected in the eigenface and then reconstructed with the eigenface of the prototypes and in Fig. 2.27b are shown the reconstructed images, their difference with the originals and the related reconstruction errors. Obviously, if they are inserted among the prototypes the error of reconstruction is zero as shown in the first column of the figure (b), in fact, the Euclidean distance is zero, that is, they have the same faces.
120
2 Fundamental Linear Transforms rms:17%
0.00
6.77
12.03
0.00
13.11
13.16
0.00
16.15
16.23
0.00
14.97
15.08
0.00
17.05
17.84
rms:10%
rms:10%
a) Faces not in the data base
b) Search prototype face and not
Fig. 2.27 Results of the recognition phase: a faces of subjects not included among the prototypes for which the eigenface was generated, their reconstruction with the eigenface of the prototypes and the difference between the reconstructed image with the original; b in the first column the faces to search and the following columns indicate the faces with greater similarity evaluated with the Euclidean distance indicated by the number above each face
2.11 Singular Value Decomposition SVD Transform The SVD transform is based on an approach used in numerical computation, for decomposition to singular values of oversized arrays, known as Singular Value Decomposition (SVD) [2,19]. It is mainly used for the solution of oversized linear systems and regression problems (for example, least squares). The SVD approach has important applications in the field of image processing, for example, for compression, for filtering (noise reduction), and for estimating the power spectrum. The direct SVD transform, unlike the other transforms, in fact, provides two kernel matrices, therefore, cannot be expressed in the simple form F = A · S (analogous to Eq. 2.102), which through the Kernel matrix A produces the vector representation F of input image represented by the S matrix of N × N dimensions. The direct SVD transform can be expressed by the following equation: F SV D = 1/2 = Ut SV
(2.136)
where U and V are unitary matrices (i.e., orthogonal) of N × N dimensions, such that corresponds to a diagonal matrix, whose elements λ1/2 (k, k) for k = 1, . . . , N
2.11 Singular Value Decomposition SVD Transform
121
typically have non-negative values and are called singular values of the image matrix S. These singular values should not be confused with the eigenvalues of the S matrix, even if they have some connection for the reasons we now explain. For the matrices U and V, which must satisfy the (2.136), being orthogonal, they are valid the relations Ut U = I N and Vt V = I N , where I N represents the identity matrix of dimensions N × N . Multiplying the last two members of the (2.136) first by U and after for Vt we get the following inverse relationship: S = U1/2 Vt
(2.137)
For the SVD transform, unlike the others that have a single kernel matrix, there are two kernel matrices U and V that satisfy the pair of equations U and V. In the case of real images, the SSt and St S matrices are symmetric and have the same eigenvalues {si } for i = 1, . . . , N . It is possible to find k orthogonal eigenvectors {uk } with k = N of the St S symmetric matrix and similarly k eigenvectors {vk } with k = N of the symmetric matrix St S that satisfy the eigenvalue characteristic equation S · St vk = λk vk
(2.138)
S · St uk = λk uk
(2.139)
for k = 1, . . . , N . In matricial form we have the corresponding equations similar to the (2.105) (2.140) = Ut [S · St ] · U = Vt [St S] · V
(2.141)
where the diagonal matrix of (2.140) has the diagonal elements λk corresponding to the eigenvalues of the symmetric matrix SSt , while those of the Eq. (2.141) correspond to the nonzero eigenvalues of the St S symmetric matrix. Note that the columns of U are the SSt eigenvectors and the V columns are the eigenvectors of St S. Considering that U and V are unitary matrix the Eq. (2.137) is demonstrated. In other words, the SVD transform of an S image is reduced to the computation of the eigenvectors of the SSt and St S symmetric matrices. The decomposition of A with the SVD method is a generalized form compared to that based directly on the eigenvalues (Eq. 2.102). In fact, with SVD, the decomposition always exists and matrix A does not necessarily have to be square. Considered a diagonal matrix, it is proved from the algebra, that the matriximage S of rank r , with the SVD transform can be decomposed in the sum of a weighted set of matrices of unitary rank. In particular, each matrix is the uk vkt outer product of two eigenvectors of size N × 1, respectively, components of the kth column of U and V, and that outer product is weighed in the summation with one of the singular values of . It follows that the decomposition of the matrix-image S, by virtue of the Eq. (2.137), can be expressed as follows: S=
r k=1
λk uk vkt
(2.142)
122
2 Fundamental Linear Transforms
where r is the rank of S and λk is the kth singular value of S. The Eq. (2.142) can be substituted in the SVD transform Eq. (2.136) and it immediately occurs that the diagonal matrix containing the singular values of the image matrix of S is obtained. In conclusion, the Eqs. (2.136) and (2.137) constitute, respectively, the direct and inverse SVD transform. Unlike the other transforms, the SVD expects two U and V Kernel matrices that depend on the image matrix S, which must be decomposed. The problem of the SVD transform is reduced to the computation of the eigenvectors of the SSt and St S matrices. From (2.136) it is noted that the SVD transform, reducing itself to the diagonal matrix , represents the S image through the N singular values λk , thus obtaining a remarkable level of compression, without loss of data (lossless compression). In reality, the singular values λk tend to become with very small values, with the index of k increasing, which can be ignored by introducing, in this case, an error (loss of information in the reconstruction) in the inverse transform (2.137). N This error can be λi , where M quantified by the sum of the singular values that are neglected ( i=M+1 indicates the number of singular values considered), thus giving an exact estimation of the mean square error. Finally, we remember the following. Even if the image S, of size N × N , is represented by at least N nonzero singular values, to reconstruct the image with the (2.137) we will need the Kernel matrices U and V. It follows that the use of the SVD transform, in the context of image compression and transmission, makes sense if the kernel matrices U and V are adequately encoded for the reconstruction phase and, possibly, for similar image sets. Figure 2.28 shows the results of the SVD transform, used to compress the Lena image with a size of 256 × 256. In the first row are shown reconstructed images with rank (r = 1, 5, 10, 20, 30, 40), reporting for each rebuilt image (using Eq. 2.142) the resulting
SVD Transform: Image reconstruction
Rk:1 Rms:20.3629 Rk:5 Rms:12.4488 Rk:10 Rms:9.4583 Rk:20 Rms:7.3112 Inf. Persa: 69% 55% 45% 35%
Rk:50 Rms:4.411 Inf. Persa: 20%
Rk:60 Rms:5.3204 17%
Rk:70 Rms:3.7538 14%
Rk:80 Rms:3.1188 12%
Rk:30 Rms:5.7479 Rk:40 Rms:4.7771 28% 24%
Rk:90 Rms:2.1811 Rk:254 Rms:0.0 10% 0%
Fig. 2.28 Results of SVD Transform applied for image compression. In the first row are shown the images reconstructed with rank (r = 1, 5, 10, 20, 30, 40) and in the second the rank starting from 50 increases up to the maximum value N in the last image that comes completely rebuilt without loss of information. For each reconstructed image is also reported the resulting RMS error and the percentage of information lost
2.11 Singular Value Decomposition SVD Transform 4
x 10
3 2.5
Singular values
Fig. 2.29 Graph of singular values calculated for the image Lena of size 256 × 256. It is noted that from component 30 the singular values become almost constant, reaching then rapidly the zero
123
2 1.5 1 0.5
50
100
150
200
250
Components k
RMS error and the loss of information (in percentage). As expected, as the rank r increases, i.e., the number of singular values, the original image is recovered with values of rank r > 40, as shown in the second row, where the last image on the right represents the completely rebuilt image. From Fig. 2.29 is observed how the singular values after the 30th component remain almost constant, then tending quickly to zero. For the compression of the data with zero loss 2N coefficients are to be saved, i.e., the vectors u k and vk and the r singular values with respect to the N 2 pixels of the image. Therefore, the total number of coefficients to be saved are 2Nr if zero loss is desired. Compression with loss of data becomes significant if 2Nr < N 2 or r < N /2. Experimentally, we find an adequate value of r = N /σ with the value of σ < 7, which depends on the type of image and the level of compression to be achieved. From Fig. 2.28 it is observed that already with r = 30 the reconstructed image (penultimate in the first row) results with acceptable distortions.
2.12 Wavelet Transform In the previously studied transforms we have seen that the informative content of a signal or an image is represented in the domain of the transforms, in terms, respectively, of temporal or spatial frequency. This domain is called the frequency spectrum. In this domain, a signal or image is decomposed (represented) by a finite set of basis functions. For example, in the Fourier transform, the orthonormal basis functions are a set of sinusoidal waves, characterized by their propagation frequency. These basis functions have nonzero value over the entire domain, infinitely extended. In reality, a signal, it is not always possible to take stationary or an image can present abrupt spatial variations. In fact, over time a signal can be modified, due to physical
124
2 Fundamental Linear Transforms ω ω
Window
f(t)
ω τ a)
t
t
b)
c)
t
Fig. 2.30 The Gabor window function (a) and the time-frequency diagram associated with the STFT (b). c Time-frequency diagram tessellation: partitioned into rectangles with constant area but different width-height controlled by wavelet functions; toward low frequencies, low rectangles but wide bases (which means high resolution in frequency and low resolution over time), instead toward high frequencies, high rectangles but narrow bases (i.e., low frequency resolution but high resolution over time)
phenomena, presenting an impulsive variation while, an image can vary spatially by the presence of edges, which are not always represented by the basis functions described so far, i.e., by coefficients of the transform. This strongly restricts the use of the Fourier transform and the other transformed considered, in order to represent in their domain, exhaustively, the informative content of a signal or an image, when one wants to well characterize their local variations (structures like peaks, edges, contents, points, etc.). In other words, passing from the time domain (for the signal) or from the spatial domain (for the image) to the frequency analysis, the temporal or spatial information is lost, respectively. It needs, therefore, to introduce basis functions that can represent the informative content of the images, not only in terms of frequencies (i.e., as global information), but also to preserve the spatial (or temporal) information of the points where these frequencies occur (i.e., maintaining local information where such variations occur). The basis functions that present this characteristic, that is, that vary spatially as well as in frequency, can be sinusoidal waves of limited duration. In 1946 Gabor [20] introduced the STFT transform (Short-Time Transform Fourier) considering only a portion of the 1D signal through a time window generating a 2D function F(ω, τ ) in terms of time τ and frequency ω (see Fig. 2.30a) % ∞ f (t)w(t − τ )e− jωt dt FG (ω, τ ) = −∞
where w(t) is the time window represented by a rectangular function whose width is constant. The STFT is seen as a compromise in combining information between time and frequency but its effectiveness depends on the amplitude of the rectangular function that unfortunately, once defined, remains constant and applies to all frequencies. Basically, it would be time windows with adaptive characteristics that are to use long time intervals where it is necessary to analyze the low frequencies (presence of little variability) and short time intervals where it is necessary to analyze the high frequencies (presence of strong variability). The solution, to this necessity of analysis,
2.12 Wavelet Transform
Sine and cosine waves
125
Morlet Wavelet
Daubeschies Wavelet
Fig. 2.31 Waves and examples of wavelet
is given by the wavelet functions which, from the intrinsic meaning indicated by the same name, i.e., small wave function is nonzero only in limited time windows. Figure 2.31 graphically shows the sinusoidal waves described so far, defined in the interval [−∞, +∞], and the typical wavelet functions with the characteristics of being adaptive, i.e., they assume nonzero value with short temporal (or spatial) periods where it is necessary to analyze the details of the signal (at high frequencies) and with long time periods where the signal has little variability (at low frequencies). In Fig. 2.30c it can be observed that the different frequencies analyzed by a signal are localized in relation to a time interval (or spatial) with different resolutions. If we consider the space-frequency domain for the wavelets it is observed how toward the high frequencies, where it serves the high spatial resolution, the rectangles (representing the different values of the wavelet transform) are narrower (while they grow in height as it does not affect the frequency resolution) just to have a more accurate value of the signal analysis. Conversely, at low frequencies, it concerns a good frequency resolution and a low spatial resolution. For image processing, this is equivalent to enhancing the local characteristics, as already done with filtering algorithms and conversion. The wavelet transform is, therefore, characterized by basis functions that translate (shift) and expand (dilation), characterizing the information content of the image (or signal), in terms of frequency and local information, respectively. In other words, wavelets capture the information content of an image in an optimal way by analyzing the structures present at different scales (spatial resolution), overcoming the limits of the sine and cosine basis functions. An example is given when we want to characterize a point structure present in the image (due, for example, to noise) normally modeled as a sum of infinite basis functions and considering the local narrowness it is difficult to represent it with the previous transforms, while with wavelets it is easily shaped by their local nature. The wavelet transforms lead to a more compact representation of the local discontinuous structures (edges, contours, points, local complex geometric structures, etc.) present in an image.
2.12.1 Continuous Wavelet Transforms—CWT The basis functions of the continuous wavelet transform are given by the prototypical wavelet function (x), also called the mother wavelet function, which generates a
126
2 Fundamental Linear Transforms
1
Mother Wavelet a=1
ψ(x)
0 -1 0
1
2
3
1
4
1
ψ(x)
0
5
6
7
8
-1 0
1
2
3
1
ψ(2x)
1
2
3
4
5
6
7
8
a= √ a a −∞ Figure 2.32 shows how a mother wavelet with the variation of a becomes flexible, expanding with a 1 in the zones of low frequencies or contracting, with a 1 in the high frequency zones of the input function, and how the variable b controls the translation along the x axis. In other words, greater dilatation involves analyzing a zone larger than the input function, that is, it is compared to the wavelet and, consequently, a more approximate result will be found in extracting the details of f (x). Conversely, a contraction of wavelets involves an analysis on a narrower zone, with the consequent greater precision in extracting the details of f (x). The domain of the wavelet transform FC W T (a, b) is normally represented graphically by the coefficients, indicating in the abscissa axis the spatial dimension x (or temporal) and, in the ordinate axis, the scale factor of each coefficient FC W T (a, b). Obviously, in the coefficients is contained the information of the input function f (x) captured by the continuous wavelet transform, and their variability is closely linked
2.12 Wavelet Transform
127
by the analysis process that extracts the local characteristics of f (x). Normally in the domain of the continuous wavelet transform, there is a high redundant information content. As with the other transforms described above, also for CWT a synthesis process is realized, i.e., the function f (x) can be rebuilt, starting from the coefficients, with the inverse continuous wavelet transform (ICWT) given by Grossman and Morlet [21] % ∞% ∞ 1 1 F (a, b) · ψa,b (x)dbda (2.145) f (x) = 2 CW T Cψ 0 a −∞ where the constant Cψ guarantees the inversion of the CWT transform if the following admissibility condition is satisfied: % ∞ | (u)|2 Cψ = du < ∞ (2.146) |u| −∞ where (u) is the Fourier transform of the wavelet function ψ(x) to real values. This condition imposes that the wavelet function chosen must be squared integrable and its Fourier transform must be zero at the zero frequency, that is, (u) = (0) = 0 and, consequently, the mean of the mother wavelet ψ(x) must be equal to zero % ∞ ψ(x)d x (2.147) ψ(0) = −∞
This admissibility condition is equivalent to having the wavelet spectrum similar to a typical transfer function of a band-pass filter. Finally, the value of the Cψ constant also depends on the chosen mother wavelet.
2.12.2 Continuous Wavelet Transform—2D CWT The Eq. (2.144) for the two-dimensional case is immediate. Given the 2D function, defined with f (x, y), the 2D CWT transform results % ∞% ∞ FC W T (a, bx , b y ) = f (x, y) · a,bx ,b y (x, y)d xd y (2.148) −∞ −∞
where, bx and b y indicate the translation, along the two coordinate axes, respectively, of x and y. The 3D dimensionality of the transform is observed with a consequent further increase in the information content. The inverse transform 2D ICWT is given by % ∞% ∞ % ∞ 1 1 F (a, bx , b y ) · ψa,bx ,b y (x, y)dbx db y da f (x, y) = 3 CW T Cψ 0 a −∞ −∞ (2.149) where, the constant Cψ remains defined by (2.146) while, the wavelet function is two-dimensional and defined by 1 x − bx y − b y (2.150)
a,bx ,b y (x, y) =
, |a| a a
128
2 Fundamental Linear Transforms
f(x)
x
x
x
x
x
x
f(x)
Fig. 2.33 Functional diagram of the CWT transform
2.12.3 Wavelet Transform as Band-pass Filtering In Fig. 2.33 is represented graphically the functional scheme of the CWT. It is noted that the calculation of the transform, once chosen a wavelet defined with the (2.143), consists in evaluating the value of the similarity between the wavelet function (the first time is placed at the beginning of the signal with scale factor 1) and the involved part of the input signal, thus obtaining the value of the coefficient FC W T (a, b). By varying b, in fact, the wavelet is translated to calculate a new coefficient, and the process continues, analyzing the whole signal. This process is repeated by modifying the wavelet with a scale factor appropriate a. The previous process is repeated for all the desired scale factor values. This functional scheme recalls that of convolution aimed at filtering a signal. In particular, this process can be seen as a filtering carried out by a bank of band-pass filters, represented by the scaled wavelet functions, applied to the input signal. In fact, we can demonstrate that the CWT transform can be traced back to a convolution operator (linear transformation) between the input signal f (x) and a wavelet once a scale factor a is defined. The CWT transform, Eq. (2.144), is rewritten in the following form: % ∞ f (x) · a (b − x)d x = f a (2.151) FC W T (a, b) = −∞
indicating that the CWT transform is obtained from the convolution between the input function f (x) and the impulse response a , which is a conjugated specular version of the wavelet derived from (2.143) and defined as follows: x 1 ∗ ∗ (2.152)
a (x) = a (−x) = √ − a a If the wavelet is a real and even function, as happens in real applications, the conjugated and specular version has no effect. In this context, the convolution, i.e., the filtering procedure, is applied to the function f (x) for each wavelet characterized
2.12 Wavelet Transform
129
Fig. 2.34 The wavelet transform implemented as a filtering operation with a filter bank based on wavelet functions
f(x)
ψ
ψn(x)
F(an,x)
by the scale factor a. In other words, the CWT transform is realized by convolutions with a bank of band-pass filters consisting of the various scaled wavelets. Figure 2.34 shows the entire CWT process based on the convolutions of the input function f (x) with n wavelet, each scaled with ai , i = 1, . . . , n. In output, from convolutions, there are 4 CWT transforms indicated with FC W T (ai , x), i = 1, . . . , n which, in fact, are n filterings of f (x) at different scales. The synthesis procedure, according with (2.145),% is obtained from % % f (x) =
1 Cψ
∞
0
∞
−∞
1 1 [ f a ](b) a (b−x)dbda = a2 Cψ
∞
0
1 [ f a a ](x)da (2.153) a2
which highlights the reconstruction of f (x) starting from one of the transformed FC W T (ai , b). The extension to the 2D case of this procedure with the filter bank is immediate. In this case, the impulse responses of the filters are two-dimensional wavelet of type
a (x, y) and in input, one has an image f (x, y). The filtered (band-pass) versions of the image represent the 2D CWT transform. It is reiterated that this approach generates a significant amount of redundant information, particularly for the 2D CWT transform, for which each output of the transform contains 3D data FC W T (a, bx , b y ) for each scale value a. The original image can be rebuilt by considering only one of the output images filtered under conditions that the transfer function (u, v) is nonzero anywhere except the origin. At this point, it can be highlighted that the peculiarity of the CWT transform does not consist in the ability to compress the information, but to analyze and decompose it.
2.12.4 Discrete Wavelet Transform—DWT For CWT, the continuity of the transform is given by the variables (a, b), scale factor and translation, expressed with real numbers. The large volume of data produced by the CWT was also highlighted, corresponding to all the scale and translation values, thus obtaining a lot of redundant information. To reduce this redundancy of data it is convenient to express scale and translation factors with integers discretizing them, as often happens, in signal and image processing applications. Hence the need to formalize the theory of wavelets from the continuous to the discreet case arises. This is possible by defining the wavelet functions with the two discrete variables in scale and translation, thus obtaining the decomposition in wavelet series.
130
2 Fundamental Linear Transforms
With the wavelet theory of the spatial (or temporal) analysis of functions, it is shown that the series expansion and the reconstruction of the functions can be performed with multi-resolution methods with discrete filters. In this context, the basic wavelet functions (x) are defined by choosing scaling and translation factors based on the powers of 2. This formalism makes the analysis very efficient, without losing precision, thus allowing the complete reconstruction of the source data and greatly reducing the redundancy of the information. A set of functions ψs,l (x), obtained with the whole variables of scale and translation, expressed in dyadic form (i.e., in powers of 2), are the wavelet basis scaled and translated if defined as follows: ψs,l (x) = 2s/2 (2s x − l)
(2.154)
where, s and l are positive integers indicating, respectively, the scale factor (dilatation or contraction, in the form 2s ) and the spatial position index (translation factor, in the form 2s l) with respect to the wavelet of origin ψ(x). The orthonormality conditions of such basis functions {ψs,l (x)} are satisfied if % +∞ ψs1 l1 ψs2 l2 = 0 with s1 = s2 or l1 = l2 (2.155) −∞
By appropriately choosing the wavelet basis functions, it is possible to completely define the space L 2 (R), i.e., the set of real measurable functions and integrable squares. It follows that the reconstruction formula given by f (x) =
+∞
+∞
FDW T (s, l)ψs,l (x)
(2.156)
s=−∞ l=−∞
that is, other functions can be derived as a linear combination given by the inner product between the basis functions ψs,l (x) and the FDW T (s, l) transform coefficients given by % FDW T (s, l) = 2s/2
+∞
−∞
f (x)ψ(2s x − l)d x
(2.157)
The last two equations give the wavelet series expansion of the function f (x) defined with the wavelet functions ψ(x) which result in the basis functions of the transform if the expansion is unique. It is highlighted that the input function f (x) is still continuous. The Haar transform, previously introduced in Sect. 2.9.4, is the simplest orthonormal wavelet transform defined by two odd rectangular pulses. The wavelet basis is progressively scaled down by a factor with a power of 2. Each smaller wavelet is then translated in increments equal to its width. As the wavelet basis √ is reduced by a factor of a power of 2, its amplitude is amplified by a power factor of 2 to maintain the condition of orthonormality. With the Haar basis functions, Eqs. (2.154) and (2.156) are considered valid only in the interval [0, 1] (including the function f (x)) and with a value of zero in the external range. In this case, the set of orthonormal basics functions are ψn (x) = 2s/2 ψ(2s x − l)
(2.158)
2.12 Wavelet Transform
131
defined with the index n alone, while the s and l indices are functions of n as follows: n = 2s + l
with s = 0, 1, . . .
l = 0, 1, . . . , 2s − 1
(2.159)
For any value of n, the largest value of s must be such that 2s ≤ n and l = n − 2s . If the spatial variable x is also discrete, this implies a sampling of the input function f ( jx) (where x is the sampling interval defined according to the Nyquist– Shannon theorem, described in Sect. 9.4 Vol. I), and the discrete wavelet transform of a function f (x) is given by f ( jx) (2s jx − l) (2.160) FDW T (n) =< f ( jx), n ( jx) >= 2s/2 j
and the inverse discrete wavelet transform (IDWT) results f ( jx) = FDW T (n)ψn ( jx)
(2.161)
n
It proves that it is possible to implement the wavelet transform using the Fourier transform and filtering operations with low-pass and high-pass filters. The theory provides appropriate filters (mirror filters) that after the direct transform it is possible to reconstruct the original signal or image in a complete way. From mirror filters it is possible to derive a so-called scaling function, from which is derivable the basis wavelet function, and therefore, the set of orthonormal basis functions (2.154).
2.12.5 Fast Wavelet Transform—FWT There is a fast version of DWT, based on the Mallat algorithm [22], called Fast Wavelet Transform-FWT, which uses the sub-band coding approach of decomposition. In Fig. 2.35a is shown the dyadic decomposition pattern, based on the sub-band encoding, proposed by Mallat. The discrete input signal f ( jx) enters in the analysis filterbank and is simultaneously filtered by the h L ( jx) and h H ( jx) filters, respectively, low-pass and high-pass filters. The input signal, after filtering, is separated in frequency with bands of equal width (see Fig. 2.35b). In particular, each output contains half of the frequency content and an equal number of samples of the original signal. It is noted, however, that the two filtering outputs, on the one
hL(jΔx)
2 g
f(jΔx)
L
Analysis Filtering
2
L
H(ω)
hL(jΔx)
Synthesis Filtering
+
f(jΔx) Low Frequency Band
a)
hH(jΔx)
2
2 H
H
High Frequency Band
hH(jΔx)
b)
0
π/2
π
ω
Fig. 2.35 Single-level decomposition and reconstruction of the FWT: a Functional diagram of the analysis and synthesis phase; b Bands of low and high frequencies associated with the first level of decomposition of the input signal
132
2 Fundamental Linear Transforms
hand, have all the frequencies of the original signal, but duplicate the number of the samples. The ↓ symbol in each output of the analysis filterbank indicates that the corresponding output signals are undersampled by the factor of 2 (downsampling), thus halving the filtered data. This subsampling is motivated to avoid sample redundancy. Operating in the discrete, subsampling occurs by discarding samples with an odd index. The reconstruction of the original signal is possible using a filterbank, in the synthesis phase, with inverse characteristics to the analysis filters. In the reconstruction phase, before the synthesis filtering, the signal will be oversampled as it had previously been subsampled. For the complete reconstruction of the signal, before the oversampling, shown in the figure with the ↑ symbol, it is necessary to insert zeros at the odd indexes previously eliminated after the analysis filtering. In this way, the reconstruction of the signal is realized correctly, without loss of data, choosing, adequately, the filters of analysis and of synthesis with symmetrical and biorthogonal characteristics, and by operating a subsampling according to the theorem of Shannon.3 The subsampling is done in this case by eliminating only half of the filtered data and then restoring the number of samples. The original signal is obtained by adding the output data of the synthesis filter. In this first level of decomposition, of the original signal, with the two low-pass and high-pass filtering, in g1L (2 jx) and g1H (2 jx) is contained, respectively, the approximate global information (that is, the components of low frequency), and the information of great variability of the signal, that is, the details (associated to the components of high frequency). The process described is the decomposition with a single level. Multi-resolution analysis (MRA) [22,23], with multilevel decomposition, involves the implementation of DWT with the iterative application of the two low-pass and high-pass analysis filters in the mode described above. In essence, the signal is analyzed at several levels with different frequencies at different resolutions. Figure 2.36 shows the process with a filter bank for 3-level decomposition and reconstruction. As can be seen, after the first level of decomposition, the double filtering is repeated in the second level, this time by decomposing only the output of low-pass filtering g1L (2 jx) of the first level while, is left intact, the filter high-pass g1H (2 jx), assuming that it already contains the maximum details of the original signal. It is evident that in g1H (2 jx) half of the lower frequencies are contained and, with the subsampling, the frequency resolution has doubled, while the spatial resolution has been reduced by half, with the presence of half samples. Recall that, in this approach, the characteristics of the actual scaling are realized with the subsampling and oversampling of the filtered signals. With the second level of decomposition, the output of the analysis filter (low-pass and high-pass) is again subsampled by a factor of two, obtaining the approximation
3 By
virtue of the Sampling Theorem, a low-pass filtered signal can be represented by half of its samples, since its maximum bandwidth has been halved.
2.12 Wavelet Transform
133 hL(jΔx)
2
hH(jΔx)
2
2
hL(jΔx)
2
hH(jΔx)
L
hL(jΔx)
2
hH(jΔx)
2
L
+
2
hL(jΔx)
2
hH(jΔx)
H
hL(jΔx) f(jΔx)
hH(jΔx)
2
L
H
2
Analysis Filtering
g
H
+
H
Synthesis Filtering
H
2
hL(jΔx)
2
hH(jΔx)
+
f(jΔx)
Fig. 2.36 Decomposition in 3 levels with low-pass and high-pass filter bank and original signal reconstruction of the FWT transform
samples g2L (4 jx) and detail g2H (4 jx), respectively, reduced by a factor 4 compared to the original samples. The decomposition process can be iterated up to a maximum level of n = log2 N , where N is the number of samples of the original signal. In each level, the output of the giH (2 · i · jx) high-pass filter represents half the frequencies higher than the frequencies contained in the low-pass filtering output data of the previous level. It follows, that the high-pass filter, in fact, behaves like a band-pass filter. The reconstruction of the signal takes place, as previously described for the first level of decomposition, through the inverse process to the analysis, called the synthesis process. Starting from the last level, the sequences of detail giH (2 · i · jx) and of approximation giL (2 · i · jx) are oversampled and the synthesis filtering is realized simultaneously with low-pass and high-pass filtering that guarantees the reconstructionof the signal. The synthesis process is iterated by adding all the output n giH (2 · i · jx) to the last approximation output gnL (2 · i · jx), of the details i=1 thus obtaining the complete reconstructed signal (see Fig. 2.36). With reference to Fig. 2.35b, in the second decomposition level the low frequency band is divided further in half at the nominal frequency of π/4, thus obtaining 2 sub-bands, one of the low frequencies and one of the high frequencies. In the third level (and in any subsequent ones), the same happens, dividing again into ever-narrower sub-bands, on the low frequency side. The decomposition level can be arbitrary and depends on the resolution you want to achieve and the computational complexity you want to accept. This multi-resolution approach, based on the iteration of analysis and synthesis filtering (the latter implemented with convolutions), has the same effect as wavelet functions at different scales. In other words, the multi-resolution method, with subband decomposition, obtains the same results as the discrete wavelet transform DWT. In the reconstruction phase, the coefficients can be manipulated according to the type of application (noise removal, compression,…). For example, by introducing thresholds, some coefficients can be eliminated or their value reduced in relation to the desired result level. Block-wise transforms, described in the previous paragraphs, are a special case of sub-band decompositions.
134
2 Fundamental Linear Transforms
2.12.6 Discrete Wavelet Transform 2D—DWT2 The two-dimensional discrete wavelet transform (DWT2) [22,23] is obtained in a similar way to the transforms studied previously, taking into consideration the separable wavelet basis function, i.e., the product of one-dimensional function ψ(x) e ψ(y). DWT2 direct decomposes the input image into four sub-images (undersampled image), as shown in Fig. 2.37. If we indicate with Is (x, y) the image at the scale sth of dimensions N × N with N power of 2. With s = 0, the scale is given by 2s = 20 = 1, corresponding to the original image I0 (x, y). Each integer s with larger value doubles the scale and halves the resolution of the image. To indicate the undersampled image, the resolution can be used as an index and, in this case, s is less than or equal to zero and its sign is inverted in the previous equations. The result of each image undersampled is obtained by filtering the image Is (x, y) at the current scale s with the three wavelet basis functions given by ψ 1 (x, y) = φ(x)ψ(y) ψ 2 (x, y) = φ(y)ψ(x) ψ 3 (x, y) = ψ(x)ψ(y)
(2.162)
where with φ the scaling function has been indicated (also called refinable function). The basic single-dimensional wavelet functions are expressed as a product of onedimensional wavelet base functions considering that the two-dimensional scaling function φ(x, y) is separable, i.e., φ(x, y) = φ(x)φ(y). The scaling function φ(x) is proved [2,24] which corresponds to the following equation: h 0 (l)φ(2x − l) (2.163) φ(x) = l
called also refinement equation with respect to the mask h 0 . The scaling function φ(x) is derivable by copying itself at half-scale and weighed by h 0 (l) which is the impulse response of the low-pass filter. Note both h 0 (l) and φ(x), we can define the high-pass filter h 1 (l), as the filter QMR (Quadrature Mirror Filter)4 of the low-pass filter h 0 , given by (2.164) h 1 (l) = (−1)l h 0 (−l + 1) from which it is possible to derive the following wavelet basis function: ψ(x) = h 1 (l)φ(2x − l)
(2.165)
l
From (2.165) it is possible to derive the orthonormal wavelet functions given by (2.154). In the case of 2D DWT we are interested in the set of orthonormal basis functions (2.162) which are t ψs,m,n (x, y) = 2s ψ t (x − 2s m, y − 2s n) (2.166)
4A
QMF filter allows to split a signal into two subsampled signals that can then be rebuilt without aliasing. A pair of QMR filters are used as a filter bank to split an input signal into two bands. The resulting low-pass and high-pass signals, from the original signal, are normally reduced by a factor of 2.
2.12 Wavelet Transform
135
where, s = 0, with t = 1, 2, 3 indicate the basis functions (2.162), and m, n are positive integers that indicate the discrete coordinates of the image. With the basis functions (2.166), the image can be expressed in terms of 2D DWT. Each of the four sub-images is obtained as an inner product with one of the wavelet basis images followed by a subsampling in a horizontal and vertical direction of a factor of two. At level s = 1 are obtained the subsampled images at the scale 2s = 21 = 2 which are named with I2t (m, n) and are given by I20 (m, n) = I21 (m, n) = I22 (m, n) = I23 (m, n) =
I − 2m, y − 2n) ' & 1 (x, y), φ(x 1 (x − 2m, y − 2n) I (x, y), ψ 1 & ' 2 & I1 (x, y), ψ 3 (x − 2m, y − 2n)' I1 (x, y), ψ (x − 2m, y − 2n)
(2.167)
In the subsequent levels with s >1, the approximation image I20s (x, y) is still decomposed into four smaller sub-images of scale 2s+1 . DWT2 is normally implemented as a one-dimensional convolution in spatial domain for the separability of functions φ and ψ. It follows that the (2.167), which represent the first stage of DWT2, can be expressed both in terms of inner product and as convolution between the images I20s (x, y) and the functions φ and ψ, thus having the following relationships in the sth stage: I20s+1 (m, n) = [I20s (x, y) ∗ φ(−x, −y)](2m, 2n)
(2.168)
I21s+1 (m, n) = [I20s (x, y) ∗ ψ 1 (−x, −y)](2m, 2n)
(2.169)
I22s+1 (m, n) = [I20s (x, y) ∗ ψ 2 (−x, −y)](2m, 2n)
(2.170)
I23s+1 (m, n) = [I20s (x, y) ∗ ψ 3 (−x, −y)](2m, 2n)
(2.171)
Remember that the symbol “*” indicates the convolution operator. Equation (2.168) performs a low-pass filtering that enhances the low frequencies in the image, the other three convolutions perform a high-pass filtering, exalting the high frequencies, that is, highlighting the details present in the image, respectively, those in a vertical, horizontal, and oblique direction. Figure 2.37 shows a graphical representation of filtering using one-dimensional convolutions, in the first decomposition level of DWT2. It operates by row and column on the image I2t s (x, y), at different levels of scale 2s , with the scaling function φ(x, y) and with the wavelet basis functions ψ t (x, y), both separable. In the first level, corresponding to the scale 2s = 20 = 1, we have the 1D convolution of the original image rows I2s (x, y) = I1 (x, y) with h 0 (−x) and with the impulse responses h 1 (−x), respectively, of the low-pass filter and high-pass filter (for simplicity indicated with h L and h H in the figure). The intermediate filtering results, related to convolution operations, are two image matrices of N × N dimensions, to which the odd-numbered columns are discarded, resulting in N × N /2 dimensions (indicated with the I L and I H in the figure). The columns of the two images thus obtained are, in turn, convolved, respectively, with the low-pass filter h L and the
136
2 Fundamental Linear Transforms Columns Rows
hL(y)
2
hH(y)
2
I(x,y)
2
hH(x)
2
IH(x,y)
hL(y)
2
hH(y)
2
ILL
ILH(x,y)
IL(x,y) hL(x)
ILL(x,y)
ILH
IHL(x,y) IHL IHH(x,y) Level 1
IHH
Fig. 2.37 First level of decomposition of the 2D DWT transform applied for images. Together with the decomposition diagram, the results of the first level filtering are visible in the four quadrants in which the original image (reduced by a factor of 4) has been decomposed and represented with I L L the mean value or approximation, with I L H the details in the horizontal direction, with I H L the vertical ones, and with I H H the diagonal details
Image decomposition with wavelet transform Level 1
Level 2
Level 3
Fig. 2.38 The 2D DWT transform applied for the decomposition of an image up to the third level
high-pass filter h H , and from the result of the convolutions are discarded the odd rows, thus obtaining the result of the first level of decomposition consisting of four sub-images of size N /2× N /2. In the figure, these sub-images are indicated with I L L representing the average value or approximation coefficients of the original image, while the others represent the details coefficients, and in particular I L H the details in the horizontal direction, with I H L the vertical ones, and with I H H the diagonal details. In analogy with the 1D context, the decomposition process is iterated, considering, of the 4 sub-images always, only the one of I L L approximation up to the desired level or to the maximum s ≤ log2 (N ) (i.e., to the scale 2s ) where N × N is the size of the input image I (x, y). The results of the decomposition of the second and third levels are shown in Fig. 2.38. This result is similar to that obtained with the Haar transform (see Sect. 2.9.4, Fig. 2.11). If the DWT2 is calculated with the good numerical resolution, for example, with real numbers, by executing the inverse transform IDWT can be rebuilt the original image if even with some degradation. Figure 2.39 graphically shows how the inverse process of DWT2 works, which is conceptually analogous to the 1D context. Starting
2.12 Wavelet Transform
137 Columns ILL(x,y)
ILL
ILH
2
hL(y)
2
hH(y)
ILH(x,y)
IHL(x,y) IHL
2
hL(y)
2
hH(y)
+
IL(x,y) 2
hL(x)
2
hH(x)
IH(x,y)
+
I(x,y)
Rows
IHH(x,y) IHH
+
Level 1
Fig. 2.39 Reconstruction of the image starting from the first decomposition level (as in Fig. 2.37) with the 2D DWT transform
from the last reached level of decomposition (in the figure is shown the reconstruction relative to the decomposition of Fig. 2.37), each I X X sub-image of approximation or detail is oversampled by a factor 2 adding a column of zeros to the left of each column, then follows, the convolution of each row with the symmetric low-pass filters h L and high-pass filters h H . The sub-images, the result of convolutions, are first summed and after oversampled, adding a row of zeros in each of the N /2 rows thus obtaining images of N × N . The columns of these two matrices are convolute with h L and with h H , and the two images that result from the convolutions are summed to reconstruct the original image I (x, y). The DWT2 is implemented using different filters. Haar and Daubechies [24] are commonly used. The latter constitute a family of orthonormal wavelets, with compact support. For a defined integer r the orthonormal wavelet family of Daubechies r ψ(x) are given by s/2 s (2.172) r ψs,l (x) = 2 r ψ(2 x − l) with s and l integers. r ψ(x) is zero for values of x outside the interval [0, 2r − 1]. An example of orthonormal wavelet families of Daubechies are obtained with the following high-pass filter: √ (2.173) h 1 (l) = 4 2h 0 (l) where h 0 (l) is the ideal low-pass filter (sinc function). From (2.173) it is possible to derive four nonzero elements, corresponding to l = 0, 1, 2, 3 which are √ √ √ √ (2.174) 1+ 3 3+ 3 3− 3 1− 3 For convolutions, the following 1D masks must be used (a) Low-pass filter √ √ √ √ 1 √ [1 + 3 3 + 3 3 − 3 1 − 3] 4 2
(2.175)
√ √ √ √ 1 √ [ 1 − 3 3 − 3 3 + 3 −1 − 3 ] 4 2
(2.176)
(b) High-pass filter
138
2 Fundamental Linear Transforms
It is observed that for r = 1 from the wavelet family of Daubechies we obtain the wavelets 1 ψ(x) that correspond to those of the Haar transform that we have with ⎧ 1 ⎪ ⎨ √2 l = 0 1 √ l = 0, 1 −1 2 h 0 (l) = h 1 (l) = √ (2.177) l=1 ⎪ 0 other wise ⎩ 2 0 other wise The Haar wavelet masks are as follows: √1 [ 1 2 √1 [ 1 2
Low-pass filter High-pass filter
1]
(2.178)
−1]
The dimensions of the filters may be larger, for example, to apply a DWT2 on an image by decomposing it into 4 × 4 blocks, using Haar’s wavelet bases, you can add two elements with zero in the masks (2.178) that become √1 [ 1 2 √1 [ 1 2
Low-pass filter High-pass filter
1 0 0]
(2.179)
−1 0 0]
Similarly for Daubechies wavelet families, masks with 8 elements can be created by adding 4 elements to the masks (2.178) to decompose the DWT2 into 8 × 8 blocks. The inverse DWT2 is implemented using inverse filters, which for Haar are identical to those of the DWT2 direct while for Daubechies the inverse filters are as follows: √ √ √ √ 1 √ [3 − 3 3 + 3 1 + 3 1 − 3] Low-pass filter 4 2 √ √ √ √ (2.180) 1 √ [ 1 − 3 −1 − 3 3 + 3 −3 + 3] High-pass filter 4 2 In summary Daubechies realizes the first wavelet families of scaling functions that are orthogonal and compact. This property ensures that the number of nonzero coefficients, in the associated filter, is finite. In Fig. 2.40 some wavelets from the Daubechies family are displayed with different support controlled by r . One notices the not strictly positivity, and, as the regularity grows slowly, with the increase of the support 2r − 1.
−0.5
0
0.5
1
1.5
2
2
1.5
1.5
1.5
1
1
1
0.5
0.5
0
0
1.2 1 0.8 0.6 0.4 0.2 0
0 0.2 0.4 0.6 0.8
1
1.2 1.4
0
−0.5
−0.5
−1
−1
−1
−1.5
0
0.5
1
1.5
2
2.5
3
−1.5
0
Scaling function db2 1.4 1.2 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4
0
0.5
1
1.5
2
2.5
1
2
3
4
5
−1.5
0
3
0
1
2
3
4
1
2
3
4
5
6
7
8
9
0
Scaling function db5
Scaling function db3 1.4 1.2 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4
Mother Wavelet db7 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1
0.5
−0.5
Scaling function db1 1.4
Mother Wavelet db5
Mother Wavelet db3
Mother Wavelet db2
Mother Wavelet db1 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1
5
1.2 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4
2
4
6
8
12
10
Scaling function db7 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4
0
1
2
3
4
5
6
7
8
9
0
2
4
6
8
10
12
Fig. 2.40 Mother wavelet functions (in the first line) and the associated scaling functions (in the second line), respectively, of db1(H aar ), db2, db3, db5, e db7
2.12 Wavelet Transform
139
2.12.7 Biorthogonal Wavelet Transform A family of wavelets (Daubechies) with the property of orthogonality and a compact support lose the characteristic of symmetry, with the exception of the first wavelet of Daubechies, corresponding to the Haar wavelet basis. This latter property is useful because it guarantees the symmetry of the filter coefficients and consequently the linearity of the transfer function. The biorthogonal wavelet family constitute a large class of basis functions that take advantage of the symmetry property with compact support. The symmetry property can be recovered [23,25,26], without losing the compact support, but without requiring orthogonality. For an orthogonal wavelet, instead of having a scaling function and a wavelet ˜ function, two scaling functions are used ψ(x), ψ(x) and two wavelet functions ˜ φ(x), φ(x). A wavelet basis is used for the decomposition (analysis phase) and the other for the reconstruction (synthesis phase). Furthermore, the scaling functions ˜ ˜ ψ(x), ψ(x) and the wavelet functions φ(x), φ(x) are dual and the wavelet families φsl (x), φ˜ sl (x) are biorthogonal, that is < ψs,l (x), ψ˜ m,n (x) >= δs,m δl,n
(2.181)
where δ is the Kronecker delta function of two variables. The coefficients of the biorthogonal wavelet transform for a one-dimensional signal f (x) are obtained from FBW 1 (s, l) =< f (x), ψ˜ s,l (x) >
and FBW 2 (s, l) =< f (x), ψs,l >
while the reconstruction of the signal f (x) is obtained as follows: FBW 1 ψs,l (x) = FBW 2 ψ˜ s,l (x) f (x) = s,l
(2.182)
(2.183)
s,l
It is observed that both wavelets can be used for the analysis, on the condition that the other one is used for the synthesis. The one-dimensional biorthogonal wavelet transform requires two discrete low-pass filters (scale vectors) h 0 (n) and h˜ 0 (n) with the characteristic of being symmetric and the relative transfer function must satisfy the condition H0 (0) = H˜ 0 (0) = 1 and take zero in the interval outside the cutoff frequency (with H uppercase is indicated the Fourier transform). Once the two low-pass filters have been defined, the band-pass filters are then generated, which proves to correspond to the following: h 1 (n) = (−1)n h 0 (1 − n)
h˜ 1 (n) = (−1)n h˜ 0 (1 − n)
(2.184)
Figure 2.42 shows the functional scheme of the biorthogonal wavelet transform, for a one-dimensional signal f (x), based on the four filters that perform the analysis and synthesis phase. The construction of biorthogonal wavelet families has been proposed in [27] by defining some properties assigned to the ψ˜ function to guarantee the regularity in the synthesis. The B-Spline functions (also proposed in [28]) are used, as scaling functions φ(x), which have the following properties: compact support, symmetry, good localization, smoothed, and efficient implementation. Figure 2.41 shows some biorthogonal wavelet functions. It is useful to name them as Bior Nr.N d where Nr is the order number of the wavelet (or scaling function) used for the reconstruction, while N d indicates the order of the function used in the decomposition phase.
140
2 Fundamental Linear Transforms Wavelet for Analysis bior 1.3
1.5
Wavelet for Analysis bior 2.2
20
1.5
15
1 0.5 0 −0.5 −1
10
1
5
0.5
0
0
−5
−0.5 −1
−10
−1.5 0
1
2
3
4
5
Wavelet for Synthesis bior 1.3
1.5
−15
0
1
2
3
4
5
Wavelet for Synthesis bior 2.2
2
−1.5
1
1.5 1
1
0
0.5
0.5
0
0
−1
−0.5
−0.5
2
3
4
5
2
3
4
5
6
7
8
9
Wavelet for Synthesis bior 4.4
−1
−1 1
1
1.5
−0.5
0
0
2
0.5
−1.5
Wavelet for Analysis bior 4.4
2
0
1
2
3
4
5
0
1
2
3
4
5
6
7
8
9
Fig. 2.41 Biorthogonal Wavelet analysis functions (in the first row) and the associated synthesis wavelets (in the second row), respectively, of Bior 1.3, Bior 2.2, e Bior 4.4 Fig. 2.42 Diagram of the phases of Analysis and Synthesis for the transformation with biorthogonal wavelets
~ f
f ~
The support for the reconstruction is extended to 2Nr +1 while for the decomposition it is 2N d + 1. The filter support has a maximum extension of max(2Nr, N d) + 2. The direct 2D transform based on the biorthogonal wavelets is made with the wavelet basis functions given by the Eq. (2.162), while, for the inverse transform, the biorthogonal wavelets are given by the following: ˜ ψ(y) ˜ ˜ ψ(x) ˜ ˜ ψ(y) ˜ ψ˜ 2 (x, y) = φ(y) ψ˜ 3 (x, y) = ψ(x) ψ˜ 1 (x, y) = φ(x)
(2.185)
The 2D biorthogonal FWT transform is implemented with the same approach described with the 2D orthonormal approach.
2.12.8 Applications of the Discrete Wavelet Transform The wavelet theory has made a significant contribution to the development of new algorithms for the analysis and synthesis of signal and image. This has allowed its diffusion in various disciplines of the scientific sector (telecommunication, computer vision, geophysics, astrophysics, medicine, …) with the consequent relapse in various industrial applications (compression, encoding, and transmission of signals and images; noise removal; and several other applications). The basic strategy in the
2.12 Wavelet Transform
141 Wavelet Transform: Image Compression
Compressed Image RMS% 3.4238 comp % 76.80
Compressed Image RMS% 3.3145 hold: 10 db5 R.comp % 79.74
Compressed Image RMS% 3.7436 Compressed Image RMS% 4.2855 hold: 10 bior2.2 R.comp % 80.40 % 65.88
Compressed Image RMS% 5.6835 comp % 91.58
Compressed Image RMS% 5.3525 comp % 92.28
Compressed Image RMS% 6.0778 Compressed Image RMS% 5.6862 R.comp % 92.69 hold: 30 db45 R.comp % 73.94
Fig. 2.43 Application of the wavelet transform for image compression. The level of decomposition used is equal to 4 and several wavelets were considered: Haar (db1), db5, bior2.2, db45. The reconstructed images quote the thresholds and the results of the compression obtained for each wavelet
use of the wavelet transform, in analogy with the other transformations described, consists in manipulating the results of the analysis phase of the wavelet transform, that is, to adequately modify the coefficients exploiting their local nature, in relation to the result to be achieved ( compression, noise removal, …), and subsequently reprocessed, during the synthesis phase, for the reconstruction of the signal or image.
2.12.8.1 Image Compression The wavelet decomposition approach applied to the image allows to analyze at different levels the informative contents (calculation of the entropy) and to determine the level of optimal decomposition. In essence, from the analysis of the DWT coefficients, in particular, those associated with the high frequency bands, of little informative content (if they are lower than a certain threshold) are filtered, and the most significant ones are quantized. Figure 2.43 shows the results of the DWT2 transform, for compressing the image, applying a decomposition level of up to 4, and using a bank of low-pass and highpass filters associated with different types of wavelets. The thresholds, the level of decomposition and the type of wavelet to be adopted are chosen as a compromise between quality to reach and the desired level of compression. These aspects are also considered in relation to the complexity of calculation and the quality of the original image. Previously, it was shown that the discrete cosine transform (DCT) was adopted as a JPEG image compression standard in 1991.
142
2 Fundamental Linear Transforms
It is also known that for high compression factors, a JPEG image, processed with 8×8 blocks, begins to present the mosaic effect of the blocks and also sensitive chromatic variations. A new JPEG 2000 standard has been adopted, based on the DWT transform. In particular, for the lossy data compression, the biorthogonal wavelets of Daubechies 9/7 are adopted, while for the lossless compression the Daubechies 5/3 wavelets are used. At low compression levels, the performance of JPEG and JPEG 2000 are comparable. Toward the high compression factors, the performance of JPEG 2000 is better without the mosaic effect of the blocks mentioned above (see for comparison the results of the DCT in Fig. 2.4). This new compression standard is based on the wavelet theory, discussed in the previous paragraph, which maintains most of the information content related to the coefficients of the LL approximation bands. The JPEG 2000 standard requires greater computational complexity at the same compression factor.
2.12.8.2 Noise Attenuation Another effective application of the DWT concerns the attenuation of the noise present in an image (or signal). Unlike the filtering operations described above, which are global, and which can generate undesirable changes to the signal, with the DWT transform, this problem is attenuated due to its effective local analysis characteristics. The procedure in this case first involves the wavelet decomposition of the image up to a certain level, and then the coefficients are manipulated by applying two types of thresholds. Figure 2.44 shows how a 1D signal is modified by applying to the detail coefficients (which are normally influenced by noise) the thresholds according to the following criterion: (a) Hard Threshold: Reset all the gin (x) coefficients that in the magnitude are less than a certain threshold τ , that is gin (x) if |gin (x)| > τ (2.186) gout (x) = 0 if |gin (x)| ≤ τ
Hard Thresholding
gout(x)
-τ
Soft Thresholding
gout(x)
τ
gin(x)
-τ
τ
Fig. 2.44 Application of the thresholding to the coefficients of the DWT
gin(x)
2.12 Wavelet Transform
143
Wavelet Transform: Gaussian Noise Removal 0.1%
Image with noise
Fig. 2.45 Application of the wavelet transform to remove the 0.1% Gaussian noise of an image. The level of decomposition used was equal to 4 and were considered different wavelet: db2, db3, db5, bior2.2, db45, bior4.4 and bior 6.8
(b) Soft Threshold: In addition to the Hard threshold, the soft threshold is applied, which consists in attenuating the value of the gin (x) coefficients of the value of the s threshold if they exceed the threshold in absolute value, that is gin (x) − τ if |gin (x)| > τ (2.187) gout (x) = 0 if |gin (x)| ≤ τ Once the coefficient modification is applied, the image can be rebuilt. The effect of the soft threshold also accentuates the compression of the image. Basically, with the application of the thresholds, in the synthesis phase, only the coefficients with high absolute value are involved. Figure 2.45 shows the effectiveness of the DWT transform for the Gaussian noise attenuation (0.1%) present in the test image. In these noise conditions and for the test image used, the best result is obtained with the soft threshold and with a value of τ = 0.1. The different applied wavelets contribute by introducing sensitive variations also on the level of sharpness of the reconstructed image. With the DWT it is possible to attenuate any type of noise and obviously the results depend on the signal-to-noise ratio intrinsic to the image.
2.12.8.3 Edge Extraction The peculiarity of the wavelet transform, to operate by analyzing the image (or signal) at different scales, is also useful for highlighting local variations in gray level, due to the edges, discriminating with respect to noise. This is achieved by decomposing the image at various scales, during the analysis phase, through the low-pass and high-pass filtering operations by choosing the appropriate filters. The
144
2 Fundamental Linear Transforms
Wavelet transform: Edge extraction with 2 decomposition levels
Original
db45
db1
bior6.7
db2
coif2
db5
sym2
Fig. 2.46 Application of the wavelet transform for edges detection of an image. The level of decomposition used was equal to 2 and several wavelets were considered: Haar (db1), db2, db5, db45, bior6.7, coif2 and sym2
extraction of the contours is done by zeroing the coefficients of the approximation band. To extract the oriented edges, in certain directions (horizontal, vertical, or oblique), the coefficients associated with the directional detail bands can be zeroed out. Figure 2.46 shows some edge extraction results (zero crossing) using the DWT transform with different types of wavelets (including the orthogonal symmetrical wavelets). The best results were obtained by decomposing the image up to level 2.
2.12.8.4 Various Applications of the DWT The DWT transform has been used, in recent years, in several other areas of computer vision and automation that we report in a concise way. (a) For the classification of different textures the algorithms developed were previously based on Gabor filters. The advantage obtained with the DWT transform was in terms of computational calculation and especially for the completeness of the frequency domain. (b) Fingerprint recognition. The DWT transform brought in this sector benefits in terms of data compression and fast recognition considering the large volume of data (millions of fingerprints to be managed) and the reactivity required in almost real time in these applications. (c) Motion analysis and control. Several algorithms were based on optical flow. DWT-based algorithms have improved the accuracy of motion parameters for vehicle tracking. In particular, Mallat’s wavelet-based DWT was effective for
2.12 Wavelet Transform
145
the automatic detection of objects while the flow of motion is detected with the Gabor wavelet. (d) Applications in general of CWT and DWT: exceeding the applicability limits of the Fourier transform that analyzes only in terms of global frequencies present in a signal but losing local information. The CWT instead allows the multiresolution analysis of the local frequencies contained in a signal in the various decomposition bands. The filter bank contains the wavelet filters with which to extract the frequency content of the signal in the various bands. Important results have been obtained for the analysis of NMR (Nuclear Magnetic Resonance) and ECG (Electrocardiography) signals to solve the problem of the location of the peaks filtered by noise and for the reconstruction of the signal, starting from the elementary waveforms.
2.13 Summary of the Chapter In this chapter, several fundamental linear transformations have been described. In the field of image processing, they are used to improve visual quality, to extract significant information, to reduce noise, to reduce data dimensionality, image compression, etc. All the transformations considered project input data into a new space with the aim of extracting the most significant characteristics from a signal or image and reducing the information content as much as possible. The transform is characterized by the transformation matrix (or more matrices) that generate the set of new data (the transform coefficients) from the inner product between the input data and the rows of the transformation matrix itself. The one of greatest interest is the unitary transform, that is, a linear operator with the characteristic of being invertible, with the kernel (the transformation matrix) that satisfies the orthogonality conditions. It follows that the inverse transform is also realized as an inner product between the coefficients and the rows of the inverse matrix of the transform. The direct transformation is also called the phase of analysis or decomposition of the input data, while the inverse transformation is called a synthesis or reconstruction phase of the input data, starting from the manipulated coefficients. Two-dimensional unitary transforms are expressed through the basis images obtained as an outer product of the rows of the transformation matrix. In other words, a transform can be considered as an operator that decomposes the input data (signal or image) into a generalized domain also known as the spectral domain of the transform. The spectral components in this transform domain represent the energy or the information content of input data. In this context, the concept of frequency is considered generalized and not exclusive of the basic functions of the sine and cosine. Some transformations can be characterized by the effect of multidimensional rotation introduced by the new coordinates. Each transformation is characterized by the transformation matrix. The desired effects on an image can be made by operating directly in the spectral domain of the transforms and then reconstructing the image processed with the inverse transform,
146
2 Fundamental Linear Transforms
thus observing the results of the transformations. This way of operating on the image has already been considered in Chap. 9 Vol. I with digital filtering operators. In particular, the classical filtering of an image was considered with the convolution operator of two images when performed in the Fourier space corresponding to the product of the transformations of the two images, the first is the image of input to be processed, and the second, the filter or kernel matrix representing how to modify the input image. With the unitary transformations, this concept of digital filtering is generalized by operating in the spectral domain of the generic transformation, such as the transforms: Cosine, Haar, Walsh, etc. In this way, the filtering operation is influenced by the properties of the transforms themselves, i.e., the characteristics of the spectrum, in terms of symmetry and compactness. For some of these transformations (DCT, Hadamard, Walsh, Haar) the transformation matrix is not characterized by input data. It has been seen instead that the PCA transform, which has the characteristic of optimal energy compactness and of complete decorrelation of the components, the transformation matrix is calculated from the statistics of the input data. In the first case, with the transformation matrix independent of the input data, it is possible to implement fast image compression algorithms, strategic especially for transmission. With the PCA, on the other hand, depending on the statistics of the input data, the conditions for reducing the dimensionality are realized, strategic when there are large volumes of data with high dimensionality due to the need to acquire, for example, multispectral images (dozens of sensors for the various spectral bands) and multitemporal images (acquired from satellite or airplane). Finally, the wavelet transform has been described that is characterized with respect to the others for its ability to be accurate in detecting frequency content together with spatial or temporal localization information, thus overcoming the limits of the Fourier transform with which local information they are lost.
References 1. K.R. Rao, N. Ahmed, T. Natarajan, Discrete cosine transform. IEEE Trans. Comput. C23(1), 90–93 (1974) 2. K.R. Castleman, Digital Image Processing, 1st edn. (Prentice Hall, Upper Saddle River, 1996). ISBN 0-13-211467-4 3. A.K. Jain, Fundamentals of Digital Image Processing, 1st edn. (Prentice Hall, Upper Saddle River, 1989). ISBN 0133361659 4. W.K. Pratt, Digital Image Processing (Wiley, New York, 1991) 5. R.N. Bracewell, The Hartley Transform (Oxford University Press, New York, 1986) 6. R.N. Bracewell, The fast hartley transform. Proc. IEEE 72(8), 1010–1018 (1984) 7. R.E. Woods, R.C. Gonzalez, Digital Image Processing, 2nd edn. (Prentice Hall, Upper Saddle River, 2002). ISBN 0201180758 8. J.L. Walsh, A closed set of normal orthogonal functions. Amer. J. Math. 45(1), 5–24 (1923) 9. H.C. Andrews, B.R. Hunt, Digital Image Restoration (Prentice Hall, Upper Saddle River, 1977)
References
147
10. K. Pearson, On lines and planes of closest fit to systems of points in space. Philos. Mag. 2(11), 559–572 (1901) 11. H. Hotelling, Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417–441 and 498–520 (1933) 12. K. Karhunen, Uber lineare methoden in der wahrscheinlichkeitsrechnung. Ann. Acad. Sci. Fennicae. Ser. A. I. Math.-Phys. 37, 1–79 (1947) 13. M. Loève, Probability Theory I, vol. I, 4th edn. (Springer, Berlin, 1977) 14. I.T. Jolliffe, Principal Component Analysis, 2nd edn. (Springer New York, Inc., New York, 2002) 15. Q. Zhao, C. Lv, A universal pca for image compression, in Lecture Notes in Computer Science, vol. 3824 (2005), pp. 910–919 16. W. Ray, R. Driver, Further decomposition of the karhunen-loève series representation of a stationary random process. IEEE Trans. 16(6), 663–668 (1970) 17. D. Qian, J.E. Fowler, Low-complexity principal component analysis for hyperspectral image compression. Int. J. High Perform. Comput. Appl. 22, 438–448 (2008) 18. M. Turk, A. Pentland, Eigenfaces for recognition. Neuroscience 3(1), 71–86 (1991) 19. G. Strang, Linear Algebra and Its Applications, 4th edn. (Brooks Cole, Pacific Grove, 2006) 20. D. Gabor, Theory of communication. IEEE Proc. 93(26), 429–441 (1946) 21. J. Morlet, A. Grossmann, Decomposition of hardy functions into square integrable wavelets of constant shape. SIAM J. Math. Anal. 15(4), 723–736 (1984) 22. S.G. Mallat, A theory of multiresolution signal decomposition: the wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 11(7), 674–693 (1989) 23. Y. Meyer, Wavelets: Algorithms and Applications (Society for Industrial and Applied Mathematics, Philadelphia, 1993), pp. 13–31 24. I. Daubechies, Orthonormal bases of compactly supported wavelets. Commun. Pure Appl. Math. 41, 906–966 (1988) 25. R. Ryan, A. Cohen, Wavelets and Multiscale Signal Processing, 1st edn. (Chapman and Hall, London, 1995) 26. M. Vetterli, C. Herley, Wavelets and filter banks: theory and design. IEEE Trans. Signal Process. 40, 2207–2232 (1992) 27. I. Daubechies, J.-C. Feauveau, A. Cohen, Biorthogonal bases of compactly supported wavelets. Commun. Pure Appl. Math. 45(5), 485–560 (1992) 28. C.K. Chui, J.-Z. Wang, On compactly supported spline wavelets and a duality principle. Trans. Amer. Math. Soc. 330(2), 903–915 (1992)
3
Geometric Transformations
3.1 Introduction The geometric transformations are necessary in different applications both to correct any geometric distortions introduced during image acquisition (for example, images acquired while the objects or sensors are in motion, as in the case of satellite and/or aerial acquisitions) or to introduce desired visual geometric effects. In both cases, the geometric operator must be able to reproduce the image with the same radiometric information as faithfully as possible. A geometric transformation modifies the location of pixels within an image (see Fig. 3.1). This is accomplished by choosing an appropriate mathematical relation that links the spatial coordinates (x, y) of the pixels of the input image f (x, y) with the new coordinates (x , y ) of the output image g. In the context of digital images, a geometric transformation consists of two operators: (a) Geometric operator that defines the spatial transformations pixel by pixel between the coordinates (x, y) of the input image f and the coordinates (x , y ) of output image g. (b) Interpolation operator that assigns an appropriate gray level to each pixel of the output image g by interpolating the pixel value in the vicinity of the homologous pixel of the input image f .
3.2 Homogeneous Coordinates Below we introduce some notations concerning the geometric analysis of 2D and 3D shapes. Usually, we refer to 2D points, to indicate the position of the pixels in the image by vector x = (x, y)t ∈ 2 (in Euclidean coordinates). These points can also be expressed in three homogeneous coordinates, x˜ = (x, ˜ y˜ , w)t ∈ P2 where P2 © Springer Nature Switzerland AG 2020 A. Distante and C. Distante, Handbook of Image Processing and Computer Vision, https://doi.org/10.1007/978-3-030-42374-2_3
149
150
3 Geometric Transformations
y’
y
x
x’
Fig. 3.1 Example of a generic geometric transformation
is called projective plane. In general, a point in an N -dimensional Euclidean space is represented as a point in an (N + 1)-dimensional projective space. To transform a point from the projective plane into Euclidean coordinates, we simply divide by the third homogeneous coordinate: (x, y) = ( wx˜ , wy˜ ) with w = 0. The relation that binds a point into homogeneous coordinates in Euclidean coordinates (inhomogeneous coordinates) results x˜ y˜ w t x˜ = (x, ˜ y˜ , w)t = w = w(x, y, 1)t = wx¯ (3.1) , , w w w where x¯ = (x, y, 1)t , called augmented vector, represents in the projective plane (third coordinate of 1) the same point (x, y) in the Euclidean plane. From Eq. (3.1) we have that the point (x, y, 1) is the same point wx¯ = (wx, wy, w) for any value of w = 0. It follows that the homogeneous coordinates are not univocal and we can state that a same point can be represented for different values of a constant α = 0, i.e., (x, ˜ y˜ , w) = (α x, ˜ α y˜ , αw). Therefore, with α = 0 we have that the point in homogeneous coordinates (0, 0, 0) is not allowed. The α constant actually introduces an unimportant scale factor on the coordinates (x, ˜ y˜ , w) of a point and are called homogeneous coordinates. Let’s now look at the peculiarity of homogeneous coordinates. We have seen that the point (0, 0, 0) is not allowed in the projective plane, but the points x˜ 0 = (a, b, 0), i.e., points with zero in the third coordinate (scaling homogeneous coordinate) are possible with a, b = 0. The x˜ 0 point of the projective plane does not have a corresponding point in the Euclidean plane ( a0 , b0 ). To understand the meaning of the point x˜ 0 we consider a geometric interpretation of the homogeneous coordinates, imagining the third coordinate with w = 1 which generates a plan (π ⇒ z = w = 1) in the 3D space. We can consider planar coordinates as the intersection of a projection line that crosses the origin of the 3D reference system. If we now consider how a point in homogeneous coordinates varies xˆ = (a, b, w) and its correspondent in the Euclidean coordinates ( wa , wb ) we have that if w tends to ∞ this represents the origin of the Euclidean plane, while if w tends to zero, the point moves in the direction defined by (a, b) to represent a point to infinity (called also ideal point). The set of all points to infinity represents the line to infinity. The Euclidean plane augmented
3.2 Homogeneous Coordinates
151
with all the points to infinity is called projective plane P2 and its points are called projective points. For homogeneous coordinates in 3D space the same approach can be applied. A ˜ y˜ , z˜ , w)t ∈ P3 in homogeneous point x = (x, y, z)t ∈ 3 is represented by x˜ = (x, 3 coordinates where P is called projective space. If w = 0 we obtain the associated point in the Euclidean space given by x = ( wx˜ , wy˜ , wz˜ )t . In analogy to the definition of projective plane P2 we also define the set of points at infinity P3 called plane to infinity. Homogeneous coordinates have numerous advantages. As we will see below, the homogeneous coordinates are simple and efficient tools to perform geometric transformations by matrix multiplication. The entire set of geometric transformations can be combined into a 4 × 4 matrix (in the case of 3D space) or into a 3 × 3 matrix in the case of two-dimensional Euclidean coordinates.
3.3 Geometric Operator The mathematical relation linking the pixel coordinates of the input image f with those of the output image can be defined in general form by the following equation: g(x, y) = f (x , y ) = f [Tx (x, y), Ty (x, y)]
(3.2)
where (x , y ) are the new transformed coordinates of a pixel that in the input image
is located at the coordinates (x, y), while the functions Tx (x, y) and Ty (x, y) specify uniquely the spatial transformations applied, respectively, to the Cartesian components x and y of the pixels of the input image f (x, y). The spatial transformation functions are given by the following general equations: x = Tx (x, y) y = Ty (x, y) Dir ect
x = Tx−1 (x , y ) y = Ty−1 (x , y ) I nver se
(3.3)
The dir ect equations (3.3) determine the new location (x , y ) of the pixel in the output image (forward projection or mapping) while the inverse equations recalculate the position of the pixel in the input image before the transformation (backward projection). Geometric transformations can be linear or nonlinear: (a) Linear (spatial) transformations include simple transformations of translation, rotation and magnification or reduction. (b) Nonlinear (spatial) transformations, also called warping transformations, correct the strong deformations of curvatures introduced in the image acquisition phase. The type of transformation to be used depends on the application context. For example, in the reconstruction of an image (for example, in the restoration context), to
152
3 Geometric Transformations P(x’,y’)
f(x,y)
g(x',y')
T (a) P(x,y)
(b)
y x
y' x'
P(x
Fig. 3.2 Direct geometric transformation
eliminate simple optical aberrations or geometric distortions, introduced during the acquisition phase, linear spatial transformations can be used. Nonlinear transformations are required in the case of geometric registration of multiple images acquired at different times and in conditions of nonlinear instability of the sensors or for nonlinear optical aberrations introduced by the optical system. The instability of the sensor is typical in the multi-temporal acquisition of satellite or aircraft images for monitoring the same territory at different times. Now let’s see how the gray levels are associated with the output image g during the process of direct spatial transformation (3.2). A first approach may be to apply the spatial transformation functions pixel by pixel to the coordinates (xi , yi ) of the input image, obtain the new coordinates (xi , yi ) and then associating them with the gray level of origin f (xi , yi ). More generally, the equation would be T
f (x, y) −→ g(x , y )
(3.4)
In other words, each input pixel with coordinates (x, y) is transformed to the new coordinates (x , y ) by the transformation functions T = (Tx , Ty ) and is associated with the original gray level f (x, y). As shown in Fig. 3.2, the generic geometric transformation T can project the input pixel between four pixels (case b) in the output image (this may depend on the numerical resolution of the transform that produces coordinates with non-integer values) or totally coincides with an output pixel (case a). Basically, we are working in discrete with the coordinates of the pixels normally expressed with integers instead, in the direct or inverse transformation, we work with real numbers with the sub-pixel resolution. These aspects on the resolution and accuracy of the geometric transformation will be considered later in the context of the interpolation operation. In case (b), the gray level of the input pixel influences the gray values to be attributed to the four neighboring pixels, in the output image, according to an established interpolation model. This last operation will be analyzed in detail below. Returning to the type of transformation adopted (Eq. 3.4), we have that every pixel of f (x, y) is projected, by means of T , to produce a new image g(x , y ). This transformation does not seem to operate properly, in fact, in relation to the
3.3 Geometric Operator
153 P(x’,y’)
T −1
f(x,y)
g(x’,y’)
(a)
(b) y’
y x
g(x,y)
x’
Fig. 3.3 Example of inverse geometric transformation
type of transformation T (especially when an image is magnified, or with warp transformations) is obtained in output an image g(x , y ) with empty regions where they have not been projected input pixels. For example, the image of Fig. 3.11a has been rotated 15◦ anticlockwise and in the black area (the empty area) of the output image in Fig. 3.11b were not projected pixels of the input image as they were not available. In fact, it can be verified, with the inverse transform (see Fig. 3.3) applied to the pixels of the empty area, that fall in areas outside the original image. An alternative approach, to eliminate the drawback of empty regions, is to associate the gray levels with the output image by first applying the inverse geometric transform T −1 as follows: T −1
with
g(x, y) ←− f (x , y )
(3.5)
x = Tx−1 (x , y ) y = Ty−1 (x , y )
(3.6)
from which it turns out that every pixel of the output image is re-transformed (projected backward) in the input image to inherit the corresponding level of gray. In other words, each output pixel (x , y ) is geometrically transformed by the inverse function T −1 in order to relocate its position (x, y) in the input image and then associate the gray level corresponding. This approach would have the advantage of avoiding voids in the output image if the corresponding pixels in the input image were available. Normally this occurs by choosing appropriate scales in the geometric transformation. With this approach, we use the input image as a look-up table for the association of gray levels between the output pixels (x , y ) and the input pixels (x, y). It can be verified, even with the inverse transform, that an output pixel, reprojected backward, can be located between four pixels in the input image as highlighted for the case (b) in Fig. 3.3. In this case, the gray level of the output pixel would be influenced by the gray levels of the four closest pixels according to the interpolation operation, to be defined. It can be concluded that:
154
3 Geometric Transformations
(a) Direct transformation, Eq. (3.4), can be used for simple linear transformations where it is certain that there are no drawbacks of empty regions in the output image; (b) Inverse transformation, Eq. (3.5), is instead suggested for all nonlinear transformations taking also into account the scale factor.
3.3.1 Translation The translation operator is a linear transformation that produces the image g(x , y ) translating all the pixels of an image f (x, y) of a quantity tx , along the coordinate axis x, and of a quantity tx , along the y coordinate axis, with respect to the origin of the coordinated input axes. The direct transformation is given by the following equations: x = x + tx (3.7) y = y + ty where (x , y ) are the output coordinates, (x, y) are those input, tx and t y are the translation constants (offset) with respect to the axis of x and y, respectively. The inverse transformation is given by x = x − tx y = y − ty
(3.8)
If the translations tx and t y are not integer values, then the transformed coordinates (x , y ), or the inverse coordinates (x, y) do not locate a pixel at the grid point, but falls between four adjacent pixels. In this case, it is necessary to use the interpolation to estimate the pixel value in the output image.
3.3.2 Magnification or Reduction Transformation coordinates to enlarge or reduce one image are given by the equations x = x · Sx y = y · Sy
Sx > 1 and S y > 1 ⇒ magnification 0 < Sx and S y < 1 ⇒ reduction Sx = 1 and S y = 1 ⇒ no effect
(3.9)
where (x, y) are the coordinates of the input image, (x , y ) are the coordinates of the transformed image, and Sx with S y , respectively, indicates the magnification or reduction respectively in the direction of the x and y axes. If Sx and S y are real, interpolation is required to assign the pixel value. The inverse transformation is given by the equations x = x /Sx (3.10) y = y /S y
3.3 Geometric Operator
155
Fig. 3.4 Rotating an image of a θ angle. A P point from the input coordinate system (x, y) is located in P in the output coordinate system (x , y )
y y’
P’ x’
α α
θ P x
3.3.3 Rotation An image can be rotated by an angle θ around at the origin of the Cartesian axes transforming the input coordinates (x, y) of each pixel considering them as points that rotate with respect to an axis. With reference to Fig. 3.4, let us consider the point P of coordinates (x, y) represented by the vector v p which forms an angle α with respect to the x axis. By rotating an angle θ the point P moves in P to the new coordinates (x , y ). With the known trigonometric formulas the relationship between the coordinates of P and P are given by x = ρ cos α y = ρ sin α
x = ρ cos(α + θ ) y = ρ sin(α + θ )
(3.11)
Applying the sum and replacing trigonometric formulas, we obtain the transformation equations for the rotation given by x = x · cos θ − y · sin θ y = x · sin θ + y · cos θ
(3.12)
where (x , y ) are the output coordinates after the rotation of an angle θ (positive value for counterclockwise rotations) with respect to the x axis of the input image. Even from these last equations, it is possible to derive the inverse functions to locate the transformed pixel (x , y ) in the input image in (x, y). However, it is more convenient to operate in matrix terms as will be shown below.
3.3.4 Skew or Shear This geometric transformation introduces a translation of the pixels pushing them an angle θ only in one direction with respect to an axis and leaving the coordinates unchanged along the other axis. The skew transformation equations in the direction of the x and y axis are
156
3 Geometric Transformations
As shown above, the amount of shearing can also be expressed by a sh x factor if the push is in the x direction or a sh y factor if the push is along the y axis.
3.3.5 Specular This geometric transformation produces the same effect as the image reflected by a mirror. It is obtained by inverting the order of the pixels of each row of the input image. The transformation equations are
where L is the size (in pixels) of the image in the direction of the rows. A vertical mirror transformation is obtained by inverting the order of the pixels of each column of the input image
where L is the vertical dimension (in pixels) of the image.
3.3.6 Transposed This transformation is obtained, as happens in the transposition of the matrices, exchanging the rows with the columns. The first row of the input image becomes the first column in the output image (Fig. 3.5). If the image is not square, the horizontal and vertical dimensions are exchanged in the output image. This transformation is often used when working with separable convolution masks (see Sect. 9.9.1 Vol. I).
3.3.7 Coordinate Systems and Homogeneous Coordinates Consider an orthonormal coordinate system, (O, i, j, k) with O a point of the Euclidean space 3 representing the origin, and i, j, and k the three orthonormal base vectors (called unit vectors or versors). The coordinates expressed with respect
Fig. 3.5 Example of transposed geometric transformation
1
2
1
4
4
3
2
3
3.3 Geometric Operator
157
to an orthonormal basis are called Cartesian coordinates. The three axes perpendicular to each other are normally denoted by x, y, z. Each point P(x, y, z) in 3D space can be represented by a vector v as a linear combination of the 3 basic versors in the following vector form v = xi + y j + zk. An image is normally represented in the Euclidean plane and all transformations will be considered in a two-dimensional plane. The homogeneous coordinates have already been introduced in the Sect. 3.2. In this context they offer an alternative way to represent translating and rotation transformations in a simpler way, allowing both to be expressed as a matrix multiplication. More generally, homogeneous coordinates allow a useful mathematical form for the representation of points in perspective geometry and in projective geometry. The latter is widely used in geometric transformations for digital images. The homogeneous coordinates in the Euclidean plane are expressed with triads of real numbers. For them, the relationships referred to in Sect. 3.2 are valid and it is useful to recall the following properties: (a) The homogeneous coordinates of a point are defined less than a proportionality factor. For example, the homogeneous coordinates (4, 3, 2, 1), (8, 6, 4, 2), i.e., the generic quatern (4k, 3k, 2k, k) with k = 0 all represent the same physical 3D point, the one that has the Cartesian coordinates (4, 3, 2) in the Euclidean plane. (b) The points with homogeneous coordinates with w = 0 are defined as points that represent only the direction of the vector defined by (X, Y, Z ) and have no geometrical meaning in the Euclidean space. The homogeneous coordinates, therefore, have the disadvantage of the non-bijective of the representation of the points. They have the peculiarity of defining the improper points that allow to represent the so-called point to infinity (called also ideal point). For example, in 2D a point at infinity P∞ is identified when a line is fixed and it is said that P∞ is the improper point of this line. In the context of geometric transformations, homogeneous coordinates, have the utility of being able to represent them in matrix form even more geometric transformations.
3.3.8 Elementary Homogeneous Geometric Transformations A geometric transformation is called homogeneous when the input and output coordinates are expressed in homogeneous coordinates. The elementary operators described above can be expressed in matrix form as ⎡ ⎤ ⎡ ⎤ x x ⎣ y ⎦ = T · ⎣ y ⎦ (3.13) 1 1 where T indicates the matrix 3×3 which characterizes the type of geometric operator, while the points of transformation are indicated in homogeneous coordinates.
158
3 Geometric Transformations
3.3.8.1 Translation in Homogeneous Coordinates Suppose we want to translate all the pixels of an image by adding a constant translation value (tx , t y ). This transformation cannot be obtained with a 2 × 2 matrix, so we apply to the coordinates (x, y) of each pixel a third coordinate of value 1 expressing them in homogeneous coordinates (x, y, 1). For this purpose, the 2D translation matrix T is constructed by considering in the first two columns the versors of the horizontal and vertical homogeneous coordinated axes, which are, respectively, (1, 0, 0)t and (0, 1, 0)t , and in the third column the origin of the translation expressed in homogeneous coordinates (tx , t y , 1)t , thus obtaining the translation operator given by ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ x 1 0 tx x x + tx ⎣ y ⎦ = ⎣ 0 1 ty ⎦ · ⎣ y ⎦ = ⎣ y + ty ⎦ (3.14) 1 00 1 1 1 where the last column indicates the translated output point of (tx , t y ) in homogeneous coordinates. The reverse translation transform is given by ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ x x 1 0 −tx ⎣ y ⎦ = ⎣ 0 1 −t y ⎦ · ⎣ y ⎦ (3.15) 00 1 1 1
3.3.8.2 Rotation in Homogeneous Coordinates In the Sect. 3.3.3 we defined the counterclockwise rotation of a θ angle in θ − sinstandard θ . If θ is negative we have the Cartesian coordinates given by the matrix cos sin θ cos θ θ sin θ . We can perform rotations using a 3 × 3 clockwise rotation matrix −cos sin θ cos θ matrix, placing a rotation matrix 2 × 2 in this, in the upper left corner. The direct and inverse homogeneous transformations for the rotation of a θ angle in 3 × 3 matrix are given by the following: ⎡
⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎡ ⎤ x cos θ − sin θ 0 x x cos θ sin θ 0 x ⎣ y ⎦ = ⎣ sin θ cos θ 0 ⎦ ⎣ y ⎦ ; ⎣ y ⎦ = ⎣ − sin θ cos θ 0 ⎦ ⎣ y ⎦ 1 0 0 1 1 1 0 0 1 1
(3.16)
3.3.8.3 Rotation and Translation Combined in Homogeneous Coordinates Normally the translation transformations of (tx , t y , 1)t and of rotation of an angle θ with respect to the origin, can be combined (see Fig. 3.6) into a single roto-translation matrix. Be careful, however, that with the multiplication of transformation matrices, in general, it is not commutative. Therefore, by combining the rotation followed by the translation together, a different result is obtained by inverting the combination of translation followed by rotation. That said, performing the roto-translation geometric x cos θ −y sin θ +t transformation in Cartesian coordinates we would have as a result x sin θ +y cos θ +txy . The same result is achieved with the product between matrices with the combined
3.3 Geometric Operator Fig. 3.6 Combined rigid transformation of roto-translation
159 y’
y
ty 0
x
θ 0
tx
x’
transformation of roto-translation operating in homogeneous coordinates, obtaining the following concatenated transformation: ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ cos θ − sin θ tx x x ⎢ ⎥ ⎢ ⎥ R t = ⎦ ⎢ y ⎥= ⎣ sin θ ⎥ ⎢ cos θ t · ·x (3.17) ⇐⇒ x y ⎣ ⎦ ⎣y⎦ 0t 1 0 0 1 1 1
T Rot−T ransl T rans f.Matri x T
where R is the rotation matrix and t the translation vector. It has 3 degrees of freedom, 2 for the translation given by the vector components (tx , t y ) and one for the rotation, expressed by the angle θ . The 3 × 3 homogeneous transformation matrix T reported in (3.17) is obtained by first considering the rotation and after the translation. These operations are not commutative. The inverse transformation of the roto-translation results ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ x x cos θ sin θ −tx cos θ − t y sin θ ⎣ y ⎦ = ⎣ − sin θ cos θ tx sin θ − t y cos θ ⎦ · ⎣ y ⎦ (3.18) 1 0 0 1 1 From the analysis of transformation matrices we have that in pure rotation the first two columns (three in the 3D case) of the matrix correspond to the transformed versors of the input reference axes, thus indicating the new directions. The last column (001)t , on the other hand, represents the transformation of the origin, which obviously has no effect in the simple rotation. In simple translation, however, the last column represents the translated origin (tx , t y , 1)t . The elementary transformations of translation and rotation, and the combined transformations of roto-translation constitute rigid transformations (also called isometric transformations) maintaining unchanged lengths and angles of the objects. Basically the shape and the geometry of the object does not change. In other words, lines are transformed into lines, planes into planes, curves into curves. Only the position and orientation of an object are changed. From the algebraic point of view, this means that being Euclidean transformations, the bases of the transformation, represented by the versors that represent the direction of the coordinate axes, satisfy the condition of orthogonality between them and of orthonormality imposing unit length. It follows that for the geometric transformation matrix (3.13) the condition T−1 T = I must be valid, i.e., the transformation must be unitary.
160
3 Geometric Transformations
If the translation with respect to the origin is applied first, followed by the rotation of an angle θ , we will have the following equations of forward and reverse translarotation transforms: ⎡ ⎤ ⎡ ⎤⎡ ⎤ x ⎥ ⎢ cos θ − sin θ tx cos θ − t y sin θ ⎥ ⎢ x ⎥ ⎢ R tT R ⎢ ⎥ ⎢ ⎥⎢ ⎥ · x (3.19) · ⇐⇒ x = ⎢ y ⎥=⎢ sin θ ⎥ ⎢ ⎥ sin θ + t cos θ cos θ t y x y ⎣ ⎦ ⎣ ⎦⎣ ⎦ 0T 1 1 0 0 1 1 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ x cos θ sin θ −tx x ⎣ y ⎦ = ⎣ − sin θ cos θ −t y ⎦ · ⎣ y ⎦ (3.20) 1 0 0 1 1 where R is the rotation matrix and tx sin θ + t y cos θ ]t tT R = [tx cos θ − t y sin θ is the translation vector.
3.4 Geometric Affine Transformations In general, a geometric linear transformation expressed by Eq. (3.2) is known as affine transformation and also includes the rigid transformations described above. By definition, affine transformation establishes a one-to-one (or bijective) correspondence (one affinity) between the pixels of the input image and those of output (the transformed image). An affine transformation is described by the following equations: x = a11 x + a12 y + tx y = a21 x + a22 y + t y
(3.21)
where the output coordinates (x , y ) are given by the linear combination of the weighted values (according parameters ai j ) of the input coordinates (x, y) and the two translation terms (tx , t y ). Being a bijective transformation it also admits a reverse affine transformation. It has six degrees of freedom associated with translation, rotation, scale (uniform and nonuniform), shearing, and reflection. These degrees of freedom are associated with the 4 parameters ai j and the two translation parameters (tx , t y ). The determination of these six parameters establishes the effect of affine transformation (Fig. 3.7). The main properties of affinity are (a) Retains the collinearity of the points; (b) Parallel lines are projected as parallel lines; (c) If two lines intersect at a point Q, the lines obtained from the transformation intersect at the point Q transformed, i.e., Q = T (Q); (d) The shape and the angles are normally not conserved, in fact, a rectangle normally becomes a parallelogram and perpendicular lines are not always transformed as perpendicular lines; (e) The ratio between parallel segments is retained. In particular, at the midpoint of a segment corresponds the midpoint of the transformed segment.
3.4 Geometric Affine Transformations
161
y’ y
0
x 0
-.3
.8
tx
.32
.8
tx
1.14 -.23
tx
.9
.62
ty
.51 .62
ty
-0.53 0.64
ty
0
0
1
0
0
1
0
0
x’
1
Fig. 3.7 Geometric affine transformations
3.4.1 Affine Transformation of Similarity A class of geometric affine transformations is the so-called similarity transformations that maintain the ratio between dimensions and angles. In particular, these transformations preserve the shape (for example, a circumference remains such), preserve the perpendicularity, the parallelism and the amplitude of the angles. The similarity transformation is described by the (3.21) which can be limited to representing only the changes in scale, rotation, and translation. We can, therefore, formulate the equation of affinity to describe these last elementary transformations. In matrix form and in homogeneous coordinates we can define the similarity transformation as follows: ⎡
⎤ ⎡ ⎤ ⎡ ⎤ x S · cos θ −S · sin θ tx x ⎣ y ⎦ = ⎣ S · sin θ S · cos θ t y ⎦ · ⎣ y ⎦ 1 0 0 1 1
⇐⇒
x =
S·R t ·x T 0 1
(3.22)
where S indicates the scale factor (magnification or reduction) and the other parameters are the same as (3.17) (rigid transformation) which describe the rotation and translation. The similarity transformation has four degrees of freedom, two for translation, one for rotation and one for the scale factor (see Fig. 3.8).
3.4.2 Generalized Affine Transformation From Eqs. (3.21) of affine transformation we can derive the general formula of the affine transformation, that is, to represent it in terms of homogeneous coordinates (increased dimension adding a third equation 1 = 0 · x + 0 · y + 1) for the efficient calculation ⎡ ⎤ ⎡ ⎤⎡ ⎤ x a11 a12 tx x A t ⎣ y ⎦ = ⎣ a21 a22 t y ⎦ ⎣ y ⎦ ·x (3.23) ⇐⇒ x = T 0 1 1 0 0 1 1 where in this case the matrix A represents the general matrix of affinity. If A is an orthonormal matrix it represents a simple rotation and the transformation is reduced
162
3 Geometric Transformations 1.73
y’
y
-1
0
1
1.73
0
0
0
1
0.48 0.13
tx
-0.13 0.48
ty
0
0
x
0
1
ty 0
tx
x’
Fig. 3.8 Geometric transformation of similarity. Example of a rotated image of 30◦ and magnified by a factor of 2; while the second transformation has rotated the image of −15◦ , reduced by a factor of 2 and translated by tx , t y . The relative transformation matrices are reported
to a rigid geometric transformation already described (see Fig. 3.6). There is a regular affinity if the determinant of A is different from zero, that is, det (A) = a11 a22 − a12 a21 = 0 This condition guarantees the existence of the inverse matrix according to the definition of a bijective correspondence between the input and output points of the transformation. The inverse transform of (3.23) is still an affine transformation that in the general formulation is given by ⎤⎡ ⎤ ⎡ ⎤ ⎡ x x C11 C12 K 1 ⎣ y ⎦ = ⎣ C21 C22 K 2 ⎦ ⎣ y ⎦ 0 0 1 1 1 where Ci j and K i are the coefficients associated, respectively, with the inverse transformation matrix A−1 and with the inverse translation vector t. In general, the inverse matrices of complex transformations are obtained using numerical methods. A generic affine transformation is achievable with the calculation of the six coefficients. This is possible by detecting in the input image and in the output image at least 3 corresponding noncollinear points (xi , yi ), (xi , yi ), i = 1, 3, which are substituted in the general Eq. (3.23) give origin to the following system with six unknowns solvable: ⎤⎡ ⎤ ⎡ ⎤ ⎡ a11 a12 tx x1 x2 x3 x1 x2 x3 ⎣ y y y ⎦ = ⎣ a21 a22 t y ⎦ ⎣ y1 y2 y3 ⎦ (3.24) 1 2 3 1 1 1 0 0 1 1 1 1 Figure 3.9 shows an example of a geometric affine transformation to correct a (simulated) deformation of an image. Three corresponding points between the deformed image and the reference image are sufficient to determine the addine transformation matrix to correct the image deformation. The system would be over determined if there were more corresponding points available and in this case the transformation
3.4 Geometric Affine Transformations Fig. 3.9 Geometric affine transformation based on 3-point matching to correct a simulated image deformation
163
(1,1)
(1,1)
(512,1)
(452,60)
(467,469) (512,512)
would be modeled with polynomials of a higher order that we will describe in the following paragraphs. At this point, in order not to generate confusion, it is important to analyze the difference between the matrix of affine transformation and that of the rigid transformation of roto-translation. Both matrices have the same dimensions 3 × 3 but the degrees of freedom of the rigid transformation matrix, Eq. (3.17), have 3 degrees of freedom, 2 relative to the translation and 1 to the rotation. The affine matrix A instead has six degrees of freedom that are associated with elementary affine transformations: 2 for the translation, one parallel to the x axis and one to the y axis; 1 for the rotation around the origin; 2 for the scaling, one in the direction of the x axis and one in the direction of the y axis; 2 for the shearing, one in the direction of the x axis and one in the direction of the y axis. These elementary affine transformations are actually 7 while the degrees of freedom are 6 corresponding to the six parameters of the affine matrix A modifiable to characterize the different geometric transformations. This apparent inconsistency is explained by the elementary twofold transformation function performed by the ai j coefficients of the affine matrix. In fact, if you combine two elementary transformations together, one of shearing in the direction of the x axis and one of rotation of 90◦ degrees, the shearing transformation is generated in the y direction. For example, with reference to the Fig. 3.7 we can observe from the first affine matrix (it generates a rotation and reflection of the image) that the coefficient a11 = −.3 performs the double function both as a rotational degree of freedom (including the value of the cosine) and reflection of the image being negative. It is also observed that any nonelementary affinity is obtained as the composition of a linear transformation and a translation. In the following paragraphs, we will analyze in detail the various elementary affine transformations individually and in combination.
3.4.3 Elementary Affine Transformations Previously we have already considered the rigid transformations of translation, rotation, and combined roto-translation. Elementary affine transformations can be defined by assigning appropriate values to the coefficients ai j in the affinity matrix.
164
3 Geometric Transformations
3.4.3.1 Affine Scaling It is obtained directly by assigning the values (sx , s y ) (of magnification or reduction) to the coefficients (a11 , a22 ) of the affine matrix A, which substituted in the general Eq. (3.23) give rise to the following direct and inverse scaling transformations: ⎡ ⎢ ⎢ ⎢ ⎣
⎤ ⎡
⎤⎡
⎤
⎡
x ⎥ ⎢ sx 0 0 ⎥ ⎢ x ⎥ y ⎥⎥⎦=⎢⎢⎣ 0 s y 0 ⎥⎥⎦·⎢⎢⎣ y ⎥⎥⎦ 1 0 0 1 1
⎢ ⎢ ⎢ ⎣
⎤ ⎡
⎤⎡
⎤
x ⎥ ⎢ 1/sx 0 0 ⎥ ⎢ x ⎥ y ⎥⎥⎦=⎢⎢⎣ 0 1/s y 0 ⎥⎥⎦·⎢⎢⎣ y ⎥⎥⎦ 0 0 1 1 1
(3.25)
Therefore, there is an enlargement if sx > 1 and s y > 1 and a reduction, if 0 ≤ sx and s y < 1. A affine transformation of scaling changes dimensions and angles and is, therefore, not a Euclidean transformation. In general, the change is scaled by varying with different values sx and s y . Nonuniform scaling (anisotropic scaling) is achieved when at least one of the scaling factors is different from the others; a particular case is unidirectional scaling. Nonuniform scaling changes the shape of the object, for example, a square can change into a rectangle or a parallelogram if the sides of the square are not parallel to the resizing axes (the angles between the lines parallel to the axes are preserved, but not all the angles).
3.4.3.2 Affine Shearing Shearing (also referred to as skewing) can occur along a single coordinate axis by setting the coefficient of the other axis to zero. In this case, the direct and inverse transformations take the following expressions: ⎡ ⎢ ⎢ ⎢ ⎣
⎤
⎡
⎤⎡
⎤
x ⎥ ⎢ 1 sh x 0 ⎥ ⎢ x ⎥ y ⎥⎥⎦ = ⎢⎢⎣ sh y 1 0 ⎥⎥⎦·⎢⎢⎣ y ⎥⎥⎦ 1 0 0 1 1
⎡ ⎢ ⎢ ⎢ ⎣
⎤
⎡
⎤⎡
⎤
x ⎥ ⎢1 −sh x 0 ⎥ ⎢ x ⎥ ⎥ ⎢ y ⎥⎦ = ⎢⎣ −sh y 1 0 ⎥⎥⎦·⎢⎢⎣ y ⎥⎥⎦ 0 0 1 1 1
(3.26)
The shear can also be generalized to 3D dimensions, in which planes are translated instead of lines. In a similar way, one could continue with other elementary transformations. For example, the matrix for horizontal reflection is obtained by placing a11 = −1, a22 = −1, tx = xc and setting the other coefficients to zero. For the vertical reflection nonnull coefficients are: a11 = 1, a22 = 1, t y = yc . The translation coefficients xc , yc indicate the center of the image.
3.4.3.3 Compound Affine Transformations An affine matrix can be realized as the combination of several elementary affine transformations. The relative composite matrix is obtained through the concatenation (product) of matrices associated with several elementary geometric transformations. For example, if we want to perform the transformations in the sequence, translation T, rotation R, scaling S, and shearing SH, we get the following composite transformation matrix: A = T × R × S × SH
3.4 Geometric Affine Transformations Fig. 3.10 Result of the compound affine transformation carried out by the linked matrix of the elementary transforming matrices in the sequence of shearing, scaling, rotation, and translation
165
−200 −100 0 100 200 300 400 500 −300 −200 −100
0
100
200
300
400
500
where A represents the affine matrix that includes the effects of this sequence of transformations. The inverse transform is given by A−1 = T−1 × R−1 × S−1 × SH−1 The order in concatenating transformations is important because the multiplication between matrices does not satisfy the commutative property. Therefore, in order to obtain the affine transformation matrix A of this sequence, the sequence of concatenation must be as follows: A = SH × S × R × T For example, given an image of M × N , the related compound matrix, for the following sequence of elementary transformations, is obtained as follows: ⎡
⎤
0.41 −0.16 −M/2 ⎥ 0.6 −N /2 ⎥ ⎦ 0 0 1
⎢ A = SH(15◦ , 20◦ ) × S(.6, .6) × R(30◦ ) × T(−M/2, N /2) = ⎢ ⎣ 0.48
Figure 3.10 shows the result of this composed matrix with the translation made in such a way as to have the origin of the reference system centered in the image. The explicit formula of a compound transformation, in homogeneous coordinates, for the sequence of translation, scaling and rotation transformations, is given by the following: ⎡ ⎤ ⎡ ⎤⎡ ⎤ x sx cos θ −s y sin θ sx tx cos θ − s y t y sin θ x ⎣ y ⎦ = ⎣ sx sin θ s y cos θ tx sx sin θ + s y t y cos θ ⎦ ⎣ y ⎦ 1 0 0 1 1
166
3 Geometric Transformations
3.5 Separability of Transformations The translation and scaling operations can be separated in the horizontal and vertical components. From Eq. (3.7) we can see that x depends only on x and y depends only on y. Therefore, the transformation operation can be performed in two steps: the first step provides an intermediate image of the processed rows only (component x); in the second step the columns (component y) of the intermediate image are processed to provide the desired output. For rotation (Eq. 3.12) we can see that both x and y depend on x and y. Therefore, it is not possible at first sight to separate the row-column processing into two independent operations. In fact, rewriting the rotation operation cos θ sin θ x x = (3.27) y − sin θ cos θ y we have that if we keep constant y and vary x we transform our image by analyzing the lines (scanline). This is not very effective because it does not prevent aliasing errors. Catmull and Smith [4] have proposed an alternative approach. In the first step, x is evaluated and then y in the second step. Keeping y constant we have x x y = x cos θ + y sin θ y (3.28) y In this way, we have an image that has been distorted and scaled in the x direction, but each pixel has its original value of y. Subsequently we transform the intermediate image keeping constant x and calculating y. Unfortunately, the equation y = −x sin θ + y cos θ cannot be used because the x values for the vertical scanline are not the correct ones for the equation. Therefore, the values of x must be inverted to obtain the correct x values. Rewriting x in function of x and considering that x = x cos θ + y sin θ we obtain x − y sin θ cos θ Inserting the (3.29) in y = −x sin θ + y sin θ we have x=
(3.29)
−x sin θ + y (3.30) cos θ which provides the rotated version of the image for an angle θ . It must be remembered that a rotation of 90◦ would result in an intermediate image collapsed on a line. It would, therefore, be preferable to read the image lines horizontally and place the result vertically (columns). In vector notation we can express the transformation of Catmull and Smith as follows: 1 0 cos θ − sin θ x x = (3.31) y 0 1 y tan θ cos1 θ y =
Although there are advantages with this two-step procedure, there is the disadvantage of losing high frequency spatial details, due to the intermediate scaling step (Fig. 3.11b). There is also a potential increase in aliasing effect, as we will see later.
3.5 Separability of Transformations
a)
167
b)
c)
Fig. 3.11 Example of rotation of 15◦ of the image in a obtained in two steps. In the first pass shown in b, the first part of the algorithm is applied only to horizontal lines. So with the intermediate image in b the second part of the algorithm is applied to the columns of b to get c
An alternative technique consists of a three-step procedure, in which there is no scaling, and therefore, no loss of high frequency detail occurs. The vector representation of this procedure is given by the following relation: x 1 − tan(θ/2) 1 0 1 0 1 − tan(θ/2) x = (3.32) y 0 1 sin θ 1 sin θ 1 0 1 y which represents a sequence of shearing operations without performing any scaling operation.
3.6 Homography Transformation Figure 3.12 summarizes the geometric transformations described up to now starting from the Euclidean one (with 3 degrees of freedom - 3 do f ), of similarity (4 do f ), of affinity (6 do f ) and the homography transformation (8 do f ) that we describe in this paragraph. We rewrite the matrix formula of the generalized affine transformation into homogeneous coordinates that establishes a correspondence between points of different planes. ⎤⎡ ⎤ ⎡ ⎤ ⎡ a11 a12 tx x x ⎣ y ⎦ = ⎣ a21 a22 t y ⎦ ⎣ y ⎦ (3.33) 1 0 0 1 1 It is called homography transformation [5], also known as projective transformation or collineation, an invertible projectivity h : P2 −→ P2 , that is, a correspondence that associates to the point P(x, y, 1) of the plane 1 the point P (x , y , 1) of the plane 2 (see Fig. 3.13), such that ρ · x = h(x)
(3.34)
⎤⎡ ⎤ ⎤ ⎡ x x h 11 h 12 h 13 ρ · x = ρ ⎣ y ⎦ ∼ = ⎣ h 21 h 22 h 23 ⎦ ⎣ y ⎦ = Hx 1 h 31 h 32 h 33 1
(3.35)
or in matrix form
⎡
168
3 Geometric Transformations
Geometric transformation
Invariance R
Area Length Angle
Euclidean (or Rigid)
t 1
0
tx Angle Length ratio
Similarity
0
sr
ty
0
1 tx
Parallelism Ratio of Areas Length ratio
(6 dof)
ty 0
Homography (or Projective) (8 dof)
0
1
Collinearity Angle Ratio (Cross-Ratios)
Fig. 3.12 Summary of geometric transformations, reported in increasing hierarchical order in relation to degrees of freedom, invariant properties and transformation matrix Y
Perspective transformation through homography
p’ (x
,y
)
p(x,y)
0 y’
x’
y
Z x
X
Fig. 3.13 Perspective plans of the same point of view related through homography transformation
where the coordinates of the points are expressed in homogeneous coordinates, H is an invertible matrix 3 × 3 defined less than a scale factor ρ = 0. The presence of the symbol “∼ =” has the following explanation: two homographic matrices that differ only in the scale are equivalent [5]. In this context we highlight some properties of projective geometry (a) Two lines always meet in a single point (to differences in what happens for the similar transformation in which if two straight lines are parallel they always remain such after the projection). (b) 3 aligned points are always projected as aligned (from which descend the name collineation or homography).
3.6 Homography Transformation
169
(c) Points and lines of the plan 1 always correspond to the other projection plane 2 . (d) A projectivity is a simple linear transformation of rays. (e) A homography projection does not preserve either distances or ratios of distances. However, the cross-ratio, which is a ratio of ratios of distances, is preserved. From this property is defined the concept of a very useful of cross-ratio in the projective geometry. Given 4 collinear points P1 , P2 , P3 , P4 in P2 , given the distance Euclidean i j between two points Pi and P j , the crossed ratio C R is defined as follows: 13 24 C R(P1 , P2 ; P3 , P4 ) = 14 23 The ratio of these ratios is invariant to the homography transformations. From the matrix form (3.34) the homographic equations can be obtained as follows: ⎧ ⎨ ρx = h 11 x + h 12 y + h 13 ρy = h 21 x + h 22 y + h 23 (3.36) ⎩ ρ = h 31 x + h 32 y + h 33 from which it is possible to derive the homographic equations in Cartesian coordinates (x, y) for points in plane 1 and (x , y ) for points in plane 2 . The homographic equations in Cartesian coordinates derived from (3.36) are ⎧ h 11 x + h 12 y + h 13 ⎪ ⎪ ⎨x = h x +h y +1 31 32 (3.37) x + h y + h 23 h ⎪ 21 22 ⎪ ⎩y = h 31 x + h 32 y + 1 It is pointed out that in Cartesian coordinates the homography transformation is no longer linear. Moreover, we already know that any multiple of the parameters of the homography matrix H achieves the same transformation because any multiple of the input or output coordinates satisfies the homographic equations (3.36). This implies that the degrees of freedom of homography are not 9, as predicted by a generalized affine transformation, but is reduced to 8 by imposing a further constraint on the elements of the homography matrix. Normally the coefficient h 33 = 1 is set. We can also consider the homography matrix as the highest hierarchical level matrix from which we can derive all the other transformations (see Fig. 3.12) described so far. It can be observed, for example, that if h 31 = h 32 = 0 the homographic equations (3.37) are reduced to the Eqs. (3.21) of the generalized affine transformation. By definition, the inverse homogeneous matrix is still a homography transformation. In other words, determined the homography matrix that projects the points from the 1 plane to the 2 plane, calculated the inverse matrix H−1 it is possible to project points from the 2 plane to 1 plan. Let’s now look at how to calculate the homography matrix and its inverse. Starting from the correspondence of points between two distinct planes, a simple linear transformation will be used to determine
170
3 Geometric Transformations
Removing perspective distortions with homography 1
2
1 2
3 4 Reference image
3
4 Distorted perspective image
Result of the Homographic transformation with 4 points
Fig. 3.14 Result of the homography projection to remove the perspective distortion of Fig. 3.13 known the 4 corresponding points in the two perspective planes
the homography matrix known in the literature as DLT-Direct Linear Transformation [5]. Considering the 8 coefficients h i j incognito in the homographic equations (3.37) it would be necessary to know the Cartesian coordinates of at least 4 corresponding points, of which at least 3 are not collinear, in the two planes to define the homography transformation (see Fig. 3.14). In fact, for each correspondence of points (xi , yi ), (xi , yi ), i = 1.4, developing the homographic equations (3.37) we have (h 31 xi + h 32 yi + 1)xi = h 11 xi + h 12 yi + h 13 (3.38) (h 31 xi + h 32 yi + 1)yi = h 21 xi + h 22 yi + h 23 that it is useful to put in the following form: xi = h 11 xi + h 12 yi + h 13 − h 31 xi xi − h 32 xi yi yi = h 21 xi + h 22 yi + h 23 − h 31 yi xi − h 32 yi y
(3.39)
In this way we have a system of 8 equations in 8 unknowns and the homography matrix is defined less than a scale factor. A solution to the unknown scale factor is to define a homogeneous matrix. Having placed h 33 = 1 the eight unknown coefficients h i j can be calculated for n > 4 points by setting the solution as an over-determined equations system through the strategy of least squares and pseudo-inverse matrix for find an optimal solution. The Eqs. (3.39) can be reorganized in matrix form considering n pairs of points, obtaining the following relation: ⎤ ⎡h ⎤ ⎡ 11 x1 y1 1 0 0 0 −x1 · x1 −x1 · y1 ⎥ ⎡ ⎤ h ⎢ 0 0 0 x1 y1 1 −y · x1 −y · y1 ⎥ ⎢ 12 x1 ⎥ 1 1 ⎥ ⎢ ⎢ ⎢ ⎢ ⎥ ⎢ h 13 ⎥ ⎢ ⎥ ⎢ y1 ⎥ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ h 21 ⎥ ⎢ .. .. .. .. .. .. .. .. .. ⎥ ⎥=⎢ · (3.40) ⎢ ⎥ ⎢. . . . . . . . . ⎥ h 22 ⎥ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎢ ⎣ ⎦ xn ⎥ ⎢ h 23 ⎥ ⎢ ⎥ ⎣xn yn 1 0 0 0 −xn · xn −xn · yn ⎦ ⎢ yn ⎣ h 31 ⎦ 0 0 0 xn yn 1 −yn · xn −yn · yn h 32
3.6 Homography Transformation
171
or in a compact matrix form: A·h=b
(3.41)
where we have indicated the matrix to the left of the (3.40) with A of size 2n × 8, with h the vector of the coefficients of dimension 8, and with b the output coordinate vector with dimensions 2n × 1. An overdetermined system can be solved using the pseudo-inverse matrix method proposed by Moore–Penrose [1]. Multiplying from the left both members of the (3.41) for At we get At A · h = At b
(3.42)
It is observed that At A is a square matrix, therefore, invertible. Therefore, if we multiply again from the left, both members of the last equation for the product (At A)−1 we have (3.43) (At A)−1 (At A) · h = (At A)−1 At b from which we note that the product of the two expressions in brackets of the first member corresponds to the identity matrix and so the previous equation represents the solution for the unknown vector h obtaining h = (At A)−1 At b
(3.44)
Direct application of (3.44) may not provide satisfactory results. In fact, the inverse function can be singular or very close to it (determinant close to zero). An alternative method is that based on eigenvalues and eigenvectors that solves the homogeneous system obtaining the h matrix less than a multiplicative factor. For this reason we use the method known as Singular Value Decomposition (SVD) described in Sect. 2.11. The SVD theorem states that given a A matrix of m × n, with m greater than or equal to n, it can be decomposed as follows: t Am×n = Um×n · Wn×n · Vn×n
(3.45)
= I) of dimensions m × n, where U is an array with orthonormal columns while V is an orthonormal matrix (Vt V = I) of dimensions m × m, and W is a diagonal matrix of dimensions n × n consisting of elements greater than or equal to zero called singular values. Returning to the calculation of the homography matrix the (3.38) is useful to write it in the following form: A·h=0 (3.46) (Ut U
obtaining ⎡
x1 ⎢0 ⎢ ⎢ ⎢ ⎢ .. ⎢. ⎢ ⎢ ⎢ ⎣xn 0
y1 1 0 0 0 0 0 x1 y1 1 .. .. .. .. .. . . . . . yn 1 0 0 0 0 0 xn yn 1
⎡
⎤ h 11 · · ⎢ h 12 ⎥ ⎢ ⎥ · · ⎢ h 13 ⎥ ⎥ ⎥ ⎢ h 21 ⎥ ⎥ ⎢ ⎢ ⎥ ⎥ .. .. .. · ⎢ h 22 ⎥ . . . ⎥ ⎥=0 ⎥ ⎢ ⎥ h ⎥ ⎢ 23 ⎥ ⎥ ⎢ ⎢ ⎥ −xn · xn −xn · yn −xn ⎦ ⎢ h 31 ⎥ ⎣ h 32 ⎦ −yn · xn −yn · yn −yn h 33 −x1 −y1
x1 −x1 x1 −y1
⎤
y1 −x1 y1 −y1 ⎥ ⎥
(3.47)
172
1
3 Geometric Transformations
3
2
Homographic projection by assigning the coordinates of the 4 corresponding points
2
4
3
1
Original image
4
Fig. 3.15 Result of a complex geometric transformation of an image by means of a homography projection, arbitrarily assigning the coordinates of 4 vertices of the homologous points
For a number of points n ≥ 4, the matrix A has dimensions 2n × 9 (all the 9 coefficients of the homography are considered) and ||Ah|| is minimized by the SVD procedure. With the SVD solution the coefficients are obtained from the following formula: n U(i) · b V(i) (3.48) h= W(i,i) i=1
where U(i) , i = 1, . . . , n indicate the columns of U; V(i) i = 1, . . . , n indicate the columns of V and Wi,i denote the diagonal elements of W corresponding to the singular values of the A matrix as indicated in the SVD decomposition equation.
3.6.1 Applications of the Homography Transformation There are several applications of homography, especially in the fields of computational vision, robotics, and image processing. We list some of the homography applications (not feasible through geometric transformations of a lower hierarchical level): (a) Transformations for the removal of perspective distortions. Figure 3.15 shows the results of the homography transformation that removes the perspective distortions of the 1 plane projected in the 2 plane as shown in Fig. 3.13. The 1 plane can be considered as the focal plane of an image acquisition system with a horizontal perspective error whose distortion can be corrected by homography projection on the 2 plane perpendicular to the Z axis coinciding with the optical axis and perspective observation. The homography transformation is sufficient given 4 corresponding fiducial points, whose coordinates are known in the two perspective planes. The homography transformation (i.e., the homography matrix H) is independent of the reference systems used in the two planes. The only imposed condition is the co-planarity of the fiducial points in the respective plans and the noncollinearity of at least 3 points of the 4.
3.6 Homography Transformation
173
(b) Image warping. Particular and complex geometric transformations of the images (see example in Fig. 3.15) leaving the most unaltered color or gray level information. (c) Image morphing. Gradual deformation of one image on the other. The easiest way to transform one image into another is based on their cross-dissolving. The idea is to obtain a sequence of intermediate images, obtained through continuous deformations (image warping) starting from the original image to the destination image. In this method, the value (color or gray level) of each pixel is interpolated over time from the first image value to the corresponding next image value. (d) Image registration. Geometric and radiometric registration of images acquired at different times and from slightly different points of view (remote sensing, medicine, ...). (e) Image mosaicing. It concerns the alignment of images that represent each a partial view of the scene of which fiducial points are known both in the partial image and in the global image to be composed (i.e., the mosaic). For example, aligning images acquired by aircraft or satellite on a cartographic map on which the fiducial points are identified with which to align the partial images acquired of a territory. (f) Panoramic mosaic. Compared to the mosaic described in the previous point, in this case, the homographies occur between images acquired from slightly different points of view with at least a 30% overlap of the same scene observed to ensure the presence of corresponding fiducial points in both images to be aligned. In this case, the fiducial points do not have a priori knowledge, but must be calculated automatically with ad hoc algorithms to ensure a robust alignment. The interpolation phase is fundamental for the image mosaicing, which must make the appearance of the scene uniform, particularly in the overlapping areas (image overlays) of the scene, observed from slightly different positions. These topics will be discussed in the following paragraphs (interpolation and automatic search for fiducial points). (g) Auto positioning of a vehicle through the homography projection of world points and image points, known as the intrinsic parameters of the image acquisition system (e.g., automatic tracking horizontal road lines for automatic guidance). (h) Camera calibration using the homography projections, for example, using the DLT-Direct Linear Transformation approach. (i) 3D reconstruction in the multi-view context based on homography projections. Coplanar points of the world are projected homographically on different image planes (direct and inverse perspective) deriving from different points of view. This makes it possible to project points of the world between cameras placed in different positions for the reconstruction of 3D points through triangulation. (j) Image corrections from geometric distortions introduced by optical systems or remote sensing imaging systems from mobile platforms (satellite or aircraft).
174
3 Geometric Transformations
3.7 Perspective Transformation In Sects. 5.6 and 5.7 Vol. I have described, respectively, the physical and geometrical model of image formation of a bright spot, generated by a lens. Let’s now face, from an algebraic-geometric point of view, the process of forming the image of a set of points that constitute a 3D object. The perspective transformation realizes the projection of a 3D object forming its image with the observation center placed at a finite distance. A perspective transformation is a central projection to distinguish it from a parallel or orthographic projection. The simplest central geometric model of image formation that relates points of the world with their position in the image plane is given by the projective model called Pinhole Camera already described in Sect. 5.6 Vol. I. Since an image is a 2D representation of a three-dimensional world, it is appropriate to change the perspective point of view of the scene to have different representations of a 3D object. The variation of the point of view of the acquired scene generates a perspective transformation. Suppose we find ourselves in the one-dimensional case for simplicity (see Fig. 3.16). An observed point B is projected in the image plane in E. The center of the coordinate system is represented by the point O which is at distance f (focal) from the image plane. All the rays coming from the object pass from the same point O which is the focal point (pinhole). The distance from O to B is measured along the optical axis OA which is z c . The projection of B, that is, the point E, is at a distance xi from the optical axis (distance OF). Using the similitude of the E F O and O AB triangles we get xi xc = f zc
=⇒
xi =
f xc . zc
(3.49)
It is useful to think of the frontal image plane, rather than the real one so that the image is not reversed in direction (see Fig. 3.16). The frontal image plane is abstract and is always f from the origin O. The objects on this plane maintain the same
Fig. 3.16 1D perspective model
3.7 Perspective Transformation
175 Zc=Zw
Fig. 3.17 2D perspective model
Yi
yc
xc P=(xw,yw,zw)
Yc=Yw
xi
Xi yi P’=(xi,yi,f)
f Xc=Xw
proportions and are used for convenience because the direction of orientation is in agreement with that of the world. It is emphasized that the perspective transformation is not linear, it is not affine, and it is not reciprocal (if an object is projected from the same point of view on planes at different distances from it, we have similar images at different scales). A point P of coordinates (xw , yw , z w ) in 3D space (see Fig. 3.17) is projected in the image plane in P = (xi , yi , f ). Taking advantage of the similarity of the triangles we get xi xc =⇒ xi = zfc xc f = zc (3.50) yi yc f = =⇒ y = y i f zc zc c Note that all points in the 3D world that are on the same projection rays of P will be projected on the same location (xi , yi ) in the image plane. This explains the nonreversibility of the transformation, i.e., the z c component cannot be determined. The relationship between the two coordinate systems can be better explicated with the help of homogeneous coordinates. Let v be the column vector ⎡ ⎤ xc v = ⎣ yc ⎦ (3.51) zc that contains the coordinates of the object point. The homogeneous coordinate vector associated with it is given by ⎡ ⎤ sxc ⎢ syc ⎥ ⎥ v˜ = ⎢ (3.52) ⎣ sz c ⎦ s where s is a constant indicating the scale, that is, the distance from the image plane. The Cartesian coordinates of the P point can be obtained from the homogeneous coordinates by dividing the first three coordinates by the s scale constant. Suppose we have a perspective transformation matrix centered in O, given by ⎡ ⎤ 10 0 0 ⎢0 1 0 0⎥ ⎥ M=⎢ (3.53) ⎣0 0 1 0⎦ 0 0 −1/ f 1
176
3 Geometric Transformations
which transforms the homogeneous coordinate vector v˜ into another vector w, ˜ given by ⎡ ⎤⎡ ⎤ ⎡ ⎤ 10 0 0 sX sX ⎢ 0 1 0 0 ⎥ ⎢ sY ⎥ ⎢ ⎥ sY ⎥⎢ ⎥ ⎢ ⎥. ˜ = M˜v = ⎢ w (3.54) ⎣0 0 1 0⎦⎣ sZ ⎦ = ⎣ ⎦ sZ 0 0 −1/ f 1 s s − sZ/f The corresponding coordinates in the image plane are obtained by normalizing w ˜ as follows: ⎤ ⎡ ⎢ w=⎣
fX f −Z fY f −Z fZ f −Z
⎥ ⎦.
(3.55)
in which the first two elements correspond to the relation of the pinhole model (Eqs. 3.50). Similarly, a point in the image plane can be projected in 3D space. In this case, it will be necessary to define an inverse matrix of M and to project backward the image point represented by w. ˜ The inverse perspective transform relative to the (3.54) is given by ˜ (3.56) v˜ = M−1 w ⎡
where M−1
1 ⎢0 =⎢ ⎣0 0
0 1 0 0
0 0 1 1/ f
⎤ 0 0⎥ ⎥ 0⎦ 1
(3.57)
and considering the P point in the image plane with homogeneous coordinates: ˜ i = (sxi , syi , sz i , s) w where z i is undefined. The inverse perspective transformation calculated with the (3.56) in homogeneous coordinates would result ⎤ ⎡ sxi ⎢ syi ⎥ ⎥ (3.58) v˜ = ⎢ ⎣ sz i ⎦ sz i s+ f with the corresponding Cartesian coordinates given by ⎡ f xi ⎤ Xc = f −z ⎢ f yi i ⎥ Yc = v˜ = ⎣ f −zi ⎦ =⇒ f zi Zc = f −z i
f xi f −z i f yi f −z i f zi f −z i
(3.59)
where (X c , Yc , Z c ) are the coordinates of the projection of the image point P reprojected in 3D space. Solving the Eqs. (3.59) with respect to the z i coordinate are obtained X c = xfi ( f − Z c ) (3.60) Yc = yfi ( f − Z c )
3.7 Perspective Transformation
177
from which it is evident that the perspective transform is not invertible since the point P in the image plane corresponds to all the points of the straight line passing through the points P and the projection center O just as the coordinate z is arbitrary. It follows that in order to uniquely determine a point on the straight line one of the 3 coordinates must be known. Normally we tend to determine Z c with other methods of computational vision (stereovision, motion analysis, optical flow, ...).
3.8 Geometric Transformations for Image Registration Of the transformation functions Tx (x, y) and Ty (x, y), their analytic representation is not always known. In many applications, the transformation function that models a particular distortion introduced by the acquisition system cannot always be defined. This normally occurs during the acquisition of images from mobile platforms (e.g., from satellite, airplane, vehicle or robot). In these contexts, although knowing the kinematics of the mobile platform, its attitude may vary even slightly unexpectedly, introducing unpredictable distortions in the acquired image. These deformations, not easily described analytically, are normally nonlinear and cannot be corrected with first-order geometric transformations described in the previous paragraphs. One way to solve the problem is to identify certain fiducial points in the scene (also called control points) that are clearly visible in the input image (the one with distortions) and in the reference image (the same geometrically correct image) in order to locate them in both images with good accuracy. Previously, in fact, we have already used for affine and homography transformation, respectively, three points and four fiducial points, (xi , yi ) in the input image and (xi , yi ) in the output image, for calculate the 6 and 8 coefficients of the respective transformation matrices (Figs. 3.9 and 3.14). The limits of the geometric transformations of the first order (affine and homographic) consist in being able to eliminate only the errors of rotation and change of scale (deformations caused by the variation of the height and the orientation of the mobile platform in the different multi-temporal acquisitions of the same zone) while the more complex deformations caused by the change of attitude of the mobile platform can be solved or attenuated with nonlinear transformations. In this context, it is necessary to use different fiducial points well distributed in the image. The determination of geometric transformation functions, to align the distorted image with respect to the reference image (or basic, error-free), is known as the problem of image registration. As a problem, the search for fiducial points between the two images emerges. In fact, the search for fiducial points becomes important because their accuracy depends on the quality of feasible geometric registration. These points can be defined directly by the user if you have precise references easily identifiable in the two images. For different applications (automatic navigation, mosaics, stereo vision, . . .) it is necessary to determine them automatically considering also the high number of fiducial points requested. Various algorithms have been developed in the literature (see
178
3 Geometric Transformations
Chap. 6 Detectors and Descriptors of Interest Points) that automatically determine fiducial points. The most known are the so-called feature-based methods of Moravec and Harris, and the most performing is the SIFT (scale invariant feature transform) algorithm, particularly in the context of registration between images with nonlinear distortions and strong variations of lighting. For limited deformations, it may be sufficient to determine the checkpoints with methods based on the image similitude of windows of images (between reference images and those searched in the image to be registered, normally of size 32× 32 or higher) evaluating their correlation.
3.9 Nonlinear Geometric Transformations The geometric transformations of the first order produce simple effects on the images summarized in Fig. 3.12. The most complex transformation is the homographic one that transforms a quadrilateral into another quadrilateral modifying dimensions and angles. Figure 3.18a schematizes how a transformation of the first order of a higher hierarchical level can correct a geometric distortion. In the figure, the transformation is modeled considering the effect of four elastic bands, which applied to the four control points in the deformed image produce the effect of restoring the geometry of the original image. Each elastic ideally exercises a different force, appropriately, to modify the image, until the geometric conditions of the correct image are restored. The deformation functions idealized by the four elastics are made from the transformation matrix H calculated on the basis of at least 4 fiducial points. For a linear transformation, defined as the transformation matrix, the relative coefficients remain constant during the transformation process. In a nonlinear transformation, we can imagine that the 4 elastic bands can modify their effect at various points in the spatial domain, generating complex distortions. From the mathematical point of view a nonlinear deformation is realized using as geometrical transformation functions Tx (x, y) and Ty (x, y) polynomials of order N . This entails a considerable increase in the calculation time. In Fig. 3.18b is shown the result of a nonlinear trans-
1’
2’ 1 2
4 3 4’ Linear deformation
a)
3’ Non-Linear deformation
b)
Barrel deformation
c)
Pincushion deformation
d)
Fig. 3.18 Linear and nonlinear image deformation: a linear deformation solvable with homography transformation; b complex nonlinear deformation; c radial deformation to barrel e d radial deformation to pincushion
3.9 Nonlinear Geometric Transformations
179
formation applied to the test Houses image that distorts it in the vortex shape with respect to the center with the following equations: x = Tx (x, y) = (x − xc )cos(θ ) + (y − yc )sin(θ ) + xc y = Ty (x, y) = −(y − yc )sin(θ ) + (y − yc )cos(θ ) + yc
(3.61)
where (xc , yc ) is the center of the image, θ = πr/k, k > 1, and with r = (x − xc )2 + (y − yc )2 In physical reality, in relation to the application context, an image (the ideal image is represented with a regular grid in the figure) can undergo irregular deformations in different areas of the image, generally caused by the acquisition system (for example, irregular camera movement). In particular, the combined effect of the acquisition system (optical component and sensor) and the dynamics of the mobile platform normally generates nonlinear deformations (typical curves and rotations) that are generally not analytically modeled for the appropriate correction. For example, the acquisition of images from aircraft in unstable situations may produce distortions similar to those of Fig. 3.18b. The intrinsic deformations of the optical-sensor system can be defined and corrected in the calibration phase of the acquisition system. In Sect. 4.5 Vol. I have been indicated the optical aberrations that produce radial deformations (they consist of moving the point along the radius that connects it to the center of the image) and tangential (perpendicular to the direction of distortion and due to the misalignment of the lens centers). Normally, the radial deformations (see Fig. 3.18c, d), due to the curvature of the lens, are the dominant ones with respect to the tangential distortions due to the decentralization of the optical centers. These nonlinear optical deformations can be modeled [2,11] when calibrating the optical system. The nonlinear deformations of the images caused by the dynamics of the acquisition system are not modelable and the approach of the image registration deformed with respect to a reference image with the use of fiducial points is appropriate. Nonlinear deformations with the presence of curvatures in the image, when not described analytically, can be corrected with nonlinear geometric transformations starting from the following bilinear equations: x = Tx (x, y) = a0 x + a1 y + a2 x y + a3 y = Ty (x, y) = b0 x + b1 y + b2 x y + b3
(3.62)
In this first level of nonlinear transformation, the coefficients are 8, and therefore, at least 4 control points are required (xi , yi ) in the input image and 4 points (xi , yi ) in the output image, to generate a system of 8 equations in 8 unknowns. Several quaterne of control points can be selected if available. Even more complex distortions (as in the case of remote sensing applications) can be attenuated using describable transformation functions (in the absence of functions that directly describe the physical phenomenon) with polynomials of equal or greater than two degrees. x = a0 x + a1 y + a2 x y + a3 x 2 + a4 y 2 + · · · + an y = b0 x + b1 y + b2 x y + b3 x 2 + b4 y 2 + · · · + bn
180
3 Geometric Transformations
which in vector form results
⎤ x ⎢ y ⎥ ⎢ ⎥ ⎥ ⎢ ⎢ xy ⎥ x a0 a1 a2 a3 a4 . . . an ⎢ x 2 ⎥ = ⎢ ⎥ y b0 b1 b2 b3 b4 . . . bn ⎢ y 2 ⎥ ⎢ ⎥ ⎢ .. ⎥ ⎣ . ⎦ ⎡
(3.63)
1 The number of control points (xi , yi ) and (xi , yi ) to choose sufficiently larger than the number of ai and bi coefficients to
in this case, must be be estimated to minimize the mean square error ξ between the control point coordinates of the deformed image and the desired one. Assuming we have M points (x1 , y1 ), . . . , (x M , y M ) of the original image, also called control points and those of the image output , y ). The mean square error is, therefore, estimated as follows: (x1 , y1 ), . . . , (x M M ξ = (x − Pa)t (x − Pa) + (y − Pb)t (y − Pb) where
a = [a0 , a1 , . . . , an ] b = [b0 , b1 , . . . , bn ] x = x1 , x2 , . . . , x M y = y1 , y2 , . . . , yn ⎤ ⎡ x1 y1 x12 x1 y1 y11 1 ⎢ x2 y2 x 2 x2 y2 y 2 1 ⎥ 2 2 ⎥ ⎢ P=⎢ . . . . . .⎥ ⎣ .. .. .. .. .. .. ⎦ xn yn xn2 xn yn yn2 1
(3.64)
(3.65)
in which you will have minimal error if
where
a = P † x b = P † y
(3.66)
−1 P† = Pt P P
(3.67)
which represents the pseudo-inverse of P that can be found by numerical methods (e.g., singular value decomposition SVD). Figure 3.19 shows a polynomial deformation of the second order with the starting and destination points shown with different patterns (asterisk, square, circle, etc.). Figure 3.20 shows the results for the correction of a more complex image deformation and the relative registration with respect to the reference image. The nonlinear deformation introduced simulates the distortion effects caused by the acquisition from the mobile platform. Compared to the reference image, the image to be corrected is rotated by 5◦ , reduced to scale, with a vertical perspective inclination of 10◦ and with radial barrel distortion. Given the complexity of the deformation, 79 fiducial points were used to calculate the coefficients of the polynomial transformation of order 3 that would require at least 10 fiducial points.
3.10 Geometric Transformation and Resampling
181
Fig. 3.19 Polynomial deformation of the second order. The various colored symbols indicate the starting and destination points shown with different patterns (asterisk, square, circle, etc.) Image registration
Reference image
Deformed image
Correct and registered image
Fig.3.20 Results of the polynomial geometric transformation of the third order to correct and record the deformed image with respect to the reference image. For the calculation of the transformation function, 79 fiducial points were used to correct the simulated nonlinear deformations of rotation, scale change, vertical perspective inclination, and radial barrel distortion
3.10 Geometric Transformation and Resampling A geometric transformation, whether linear or nonlinear, changes the position of the pixels with the possibility of decreasing the number of the same (in the case of image reduction producing a subsampling) or increasing them (in the case of image enlargement producing a oversampling). This operation essentially changes the geometric and radiometric characteristics with which the image was originally acquired. In analogy to a hardware acquisition device that normally produces a discrete image with constant resolution, a geometric transformation can be considered as a process of reacquiring an image with variable resolution through a software pseudodevice. The latter needs to interpolate the pixels of the input image after redefining the position of the pixels in the new output image. At the beginning of the chapter, we have already anticipated the need to interpolate the image that then leads to the concept of resampling which is the process of regaining a discrete image from a set of pixels geometrically transformed into another discrete set of pixel coordinates.
182
3 Geometric Transformations x
n
Digitization
Geometric transformation
y’ 0
0
0
fI(x,y)
+ f(x,y) y
age l Im sica Phy
m
Ori
gi
sa nal
mp
fR(x’,y’)
led
ima
ge
x’
Resampled image
fS(m,n)=f(mΔx,nΔy)
Fig. 3.21 Geometric transformation and the resampling problem. The light energy of the scene through the optical system produces the analogical physical image f (x, y). The digitizing system (represented by Vidicon) produces the sampled image f s (m, n) which can be resampled producing the digital image f R (x , y )
In other words, conceptually the resampling can be thought of as constituted by two processes, the first is the process of re-capture of the continuous image through the interpolation process and the second is the sampling process of the interpolated continuous image. We have described this latter process in Sect. 5.10 Vol. I, in the context of image discretization obtained through an electronic scanning device. In analogy to Fig. 5.6 Vol. I can schematize it in Fig. 3.21 a sort of software scan that performs the above two processes of interpolation and resampling in order to reconstruct the image (apparently we can consider it reacquired) with the sampling characteristics of the original image (obtained electronically) although modified with the desired geometric transformation. In the figure, we observe the discrete image f S (m, n) that we assume corresponds to the sampled image derived electronically from the continuous image f (x, y). It is assumed that the digitization process has created a discrete image of samples on a regular grid, that is, f S (m, n) = f (xx, yy) where (x, y) is the uniform spacing of the original sampling. The goal of interpolation is to reproduce the original continuous image values f (x, y) at each arbitrary coordinate point (x, y). With resampling we want to estimate the pixel value in arbitrary points f (x, y) that are different from the original ones sampled at the coordinates (mx, ny). Reverse geometric transformations define the (x, y) input coordinates for interpolation. Generally, the interpolation problem is tractable as an ill-posed problem: we only have discrete f S (m, n) samples and we want to try to calculate the value of the original continuous function at undefined arbitrary points of the 2D real space (x, y ∈ R). In fact, there are endless possibilities to reconstruct the original continuous function f (x, y) starting from the discrete values available f S (m, n). Normally when a problem is ill-posed we try to find an optimal evaluated solution that in this case should tend to approximate f (x, y) starting from the discrete f S (m, n) through an interpolation process that as we will see it will be based on different methodologies.
3.10 Geometric Transformation and Resampling
183
The type of applied geometrical transformation will condition the interpolation methodology to be selected. A geometric transformation involving the subsampling of the image, reducing the number of pixels, can substantially reduce the spatial frequencies present in the image. This can generate artifacts in the transformed image for the aliasing phenomenon that replaces the spatial frequencies under sampled with spurious frequencies and then introduces distortions in the reconstructed image. From the sampling theory, it is known that the spatial structures of the image are not altered if the sampling step is twice that of the highest spatial frequency present in the image. To limit this problem, you can first apply a low-pass filter to the image to attenuate the high frequencies and then apply the geometric transformation. A geometric transformation that instead introduces an oversampling of the image, increasing the number of pixels, involves a more accurate interpolation to avoid the addition of artifacts (mosaic effect) in the image that are the counterpart of aliasing. As we will see, oversampling tends to amplify the high frequencies. The existing interpolation methodologies are different and depend on the level of accuracy to be achieved by using one or more pixels in the vicinity of the pixel to be interpolated. The interpolation process, that is, the process that converts the discrete image f S (m, n) into interpolated image continues f I (x, y), can be achieved through the convolution operation between the discrete image f S (mn) (input image) and the interpolation function (interpolation impulse response) h ∞ ∞ f S (m, n) · h(x − mx, y − ny) (3.68) f I (x, y) = m=−∞ n=−∞
where x, y, respectively, represent the resampling grid step1 along the coordinate axes x and y. Hardly the interpolated value f I (x, y) can coincide with the original value f (x, y), if not in correspondence with the samples to the coordinates (m, n), while in general we will try to approximate as much as possible to have f I (x, y) = f (x, y) at least in the range 0 ≤ x ≤ Mx and 0 ≤ y ≤ N y, where M × N indicates the size of the image to be interpolated. From the sampling theory (Sect. 5.10 Vol. I) it is known that the ideal interpolation function x sinπ y sinπ x y (3.69) h I (x, y) = y x π x π y provides the exact reconstruction of the image by interpolating the entire sync definition range. In real applications, simpler interpolation functions are used on finite intervals and not necessarily the reconstruction is done with convolution. An estimate of the reconstruction error can be evaluated by considering in the frequency domain how much the interpolation function used is different from the ideal one which is known to be the rectangle function. The resampling process will then be carried out by multiplying the continuous interpolated function f I (x, y) with the sampling function (that is the pulse train function) thus obtaining the discrete function f R (x, y) resampled to the new points geometrically transformed. 1 From
now on we will assume x = y = 1.
184
3 Geometric Transformations FI(u)
Δx
x
fR(mΔx)
fI(x)
F
fS(x)
Resampling
-1/Δx -um
0
um 1/Δx u
FS(u)=F(u)*III(uΔx)
-um
0
um
fR(mΔx)=fI(x)III(xΔx)/Δx
FI(u)=FS(u)II(u/2ul)
III(xΔx)/Δx
Δx
III(uΔx)
f(x,y)
1/Δx
Δx
Fig. 3.22 Interpolation and resampling of images in the context of geometric transformations
3.10.1 Ideal Interpolation In Sect. 5.10 Vol. I, we have demonstrated with the sampling theorem that it is possible to completely reconstruct from the sampled image f S (m, n) the original continuous image f (x, y) without loss of information in the conditions of Nyquist. In this case, the complete reconstruction of the interpolated image f I (x, y), using the sampled image f S and the ideal interpolation function h I (x, y), in the one-dimensional case, can be expressed in terms of convolution as follows: f I (x) = f S (m) ∗ h I = f S (m) ∗
x 1 sinπ x x x π x
(3.70)
where x indicates the sampling spacing of the original image in the spatial domain. Remember that, if f (x) was the original function,2 in the hypothesis of limited bandwidth function, it must satisfy the following condition: F(u) = 0
if
|u| ≥ u m
where u m is the highest frequency in the image. We also know from the sampling theory that the calculation of the sampled function is obtained by multiplying the continuous function by the sampling function. In this context the function to be resampled is the continuous interpolated function f I (x). It follows that the calculation of the resampled function f R (x) can be modeled as the product between the continuous interpolated function f I (x) and the sampling function δ(x/x). The latter represents a train of pulses spaced at intervals x. The resampling function f I (x) leads to a loss of information in the x intervals of the sampled function f R (x) which keeps the information only at the points of sampling ix weighted by the values of the function f I (x), while it results null for all the points inside the sampling intervals. Figure 3.22 shows the limited bandwidth function f I (x) and the resampled function f R (x) in the spatial domain and in the frequency domain.
2 For
simplicity, we will use the interchangeable term of image and function in the 2D case.
3.10 Geometric Transformation and Resampling
185
For the convolution theorem (Sect. 9.11.3 Vol. I), when a function f (x) is multiplied by another function δ(x/x), it is possible to convolve the corresponding 1 δ(x · u), the latter cortransforms of Fourier which in this case are F(u) and x responding to the Fourier transform of the pulse train function which in this case is still a train of pulses spaced 1/x in the frequency domain. We point out that the latter assertion is motivated by the definition of the Shah pulse train function (also known as the Comb function) given by I I I (x) =
∞
δ(x − n)
(3.71)
n=−∞
where the infinite unitary impulses are spaced with unitary value along the x axis. For the similarity theorem the pulse has the following scaling property 1 δ(x) (3.72) |a| From the latter equation we can derive the scaled formula of the Shah function in the expansion or compression conditions of the pulse train given by δ(ax) =
I I I (ax) =
∞ n=−∞
δ(ax − n) =
∞ 1 n n = δ a x− δ x− a |a| n=−∞ a n=−∞ ∞
Setting the scale factor a = 1/x with x > 0 the (3.73) becomes ∞ x = x III δ(x − nx) x n=−∞
(3.73)
(3.74)
If both members of the latter equation are divided by x we get the Shah function consisting of a train of unitary pulses and uniformly spaced by x ∞ x /x = δ(x − nx) (3.75) III x n=−∞ It is convenient to have the Fourier transform of the Shah function (that is, the 3.71) which is as follows: F {I I I (x)} = I I I (u) (3.76) from which it turns out that also in the Fourier domain it is still a train of impulses. Consider the similarity theorem 1 u (3.77) F {δ(ax)} = F |a| a To derive the formula of the Fourier transform of the Shah function composed of a train of unitary pulses and spaced by x, considering the (3.75) we obtain F {I I I (x/x)/x} = I I I (xu) where I I I (xu) =
∞ 1 n δ u− x n=−∞ x
(3.78)
(3.79)
186
3 Geometric Transformations
thus obtaining the Shah function composed of a train of pulses of module 1/x spaced of 1/x. The process of convolution of a function f (x) with a pulse produces a copy of the function itself (as shown in Fig. 3.22). Thus the sampled version f S (x) of a continuous function f (x) results f S (x) =
∞
f (x)δ(x − nx)
n=−∞
= f (x)
∞
= f (x) · I I I
(3.80)
δ(x − nx)
n=−∞
x /x x
where the last step is motivated by the (3.75) that is the pulse train function. The initial question is to retrieve the original function f I (x) from the sampled function f S (x). In reality, it will be more convenient to retrieve this function via FI (u) from the spectrum of the sampled function FS (u). Applying the Fourier transform and the convolution theorem to the first and last member of the last Eq. (3.80) we obtain the following: x /x} (3.81) F { f S (x)} = F { f I (x)} ∗ F {I I I x Indicating with capital letters the Fourier transform, with reference to the (3.73) with the scale factor a = 1/x and to the (3.77), the previous equation becomes F S (u) = F I (u) ∗ III(x · u)
∞ 1 = F I (u) ∗ δ u− x n=−∞ ∞ 1 = F I (u) ∗ δ u − x n=−∞ ∞ 1 n = FI u − x n=−∞ x
n x n x
(3.82)
In essence, the latter Eq. (3.82) states that the convolution process of FI (u) with the pulse train I I I (xu) produces a replica of the F(u) spectrum for each 1/x sampled point, similar to the convolution in the spatial domain between function and impulse that reproduces a copy of the function itself. It is also observed that the spectrum FS (u) of the sampled function is periodic with a frequency of 1/x in the frequency domain. The idea of recovering the interpolated function FI (u) from the spectrum of FS (u) can be satisfied by operating in the frequency domain by eliminating the replicas of FI (u) leaving unchanged only the copy of the spectrum centered at the origin of the
3.10 Geometric Transformation and Resampling
187
u axis and multiplying the spectrum FS (u) with a rectangular function (u/2u l ) that satisfies the following relation: 1 u m ≤ ul ≤ (3.83) − um x where u m is the maximum frequency present in f (x) and u l is the width of the rectangular function . To retrieve F I (u) from the spectrum of the sampled function f S (x), we execute the following product for both the last members of the (3.82) with the rectangle function that, in fact, acts as a low-pass filter obtaining the following relation: u = F I (u) (3.84) F S (u) · 2u l The interpolated function f I (x) is recovered by applying the inverse Fourier transform to both members of the last equation as follows: u = F −1 {F I (u)} = f I (x) (3.85) F −1 F S (u) · 2u l and for the convolution theorem applied to the first member of this last equation, we obtain the interpolated function sought x sin π x f I (x) = f S (x) ∗ x (3.86) π x obtained through the convolution of the sampled function f S (x) with the interpolating function known as sinc function in the form sinc(x) = sin(x)/x. In summary, the complete reconstruction of the interpolated function f I (x) from the discrete input function f S (x) was possible under the following conditions: 1. f I (x) must be of band type limited to the maximum frequency u m , i.e., FI (u) = 0 if |u| ≥ u m . 2. The relation between the sampling interval x in the spatial domain and the maximum frequency u m (also called cutoff frequency) present in the function f I (x) must satisfy, for the sampling theorem, the following expression: 1 x ≤ (3.87) 2u m according to the condition of N yquist. The convolution process with the sampled function f S (x), essentially replicates the interpolating function sin(x)/x in each sample of f S . In Fig. 5.32 of Sect. 5.10 Vol. I was shown the effect of the convolution process to realize the reconstruction of the original continuous function (in this context the interpolated function f I ) through the replication of the sinc interpolation function convoluted with the discrete function f S or through the linear combination between the values of the discrete function f S and the sinc functions weighed by the same discrete values of f S . As shown in the figure, the interpolation process given by the (3.86) can be thought of as a superposition of translated and scaled sinc functions. In particular, a sinc
188
3 Geometric Transformations
function is translated into every discrete value of f S to be interpolated and scaled with the same value of f S , all sinc are added together. The zero crossing only occur in integer values except for the integer corresponding to the central lobe. This means that at the discrete value x = nx the only contribution of the sum will concern only the single sample to be interpolated x. In this case, interpolation is achieved through existing samples as it should be. For the image the ideal interpolation function is indicated with sinc2 and in discrete form the interpolated image is obtained by the following convolution: ∞ ∞ x − mx y − ny (3.88) f S (m, n)sinc2 f I (x, y) = , x y m=−∞ n=−∞ Although the ideal interpolating function sinc would provide the complete reconstruction of the original continuous function is not easily realized considering that in the spatial domain it has infinite extension and it would be impractical to implement the convolution of a discrete function with a function with infinite definition interval. Intuitively you could think of truncating the sinc in the spatial domain but you would have in the Fourier domain (see Fig. 3.23) no longer a perfect rectangle (the base of the rectangle would have distortions, i.e., curved lines enlarged at the central peak). Basically, it would no longer act as an ideal low-pass and stop-band filter (on the sides of the rectangle where the distortions are formed). In the following paragraphs, the alternative interpolating functions are described starting from the very simple linear ones up to the more complex ones that tend to those of the ideal filter even if they require more calculation. The interpolated image with a linear interpolating function h(x, y) is given by the following convolution: f I (x, y) =
∞
∞
f S (m, n)h(x − mx, y − ny)
(3.89)
m=−∞ n=−∞
where h(x, y) is defined by a bilinear combination of the f S (m, n) pixels in the vicinity of the (x, y) pixel being processed. The quality of the interpolating functions is assessed in the Fourier domain by analyzing the band-pass and stop-band components, respectively, the zones of the
Windowed Sinc
FT windowed sinc − Magnitudo FT
1.2
1
0.8
1
0
0.6
0.8
−1
0.4 0.2
0.6 0.4
0
0.2
-0.2
0
-0.4 −10
(a)
−5
0
x
5
10
H(u)
H(u)
sinc(x)
1
FT windowed sinc − Magnitudo FT log
−2 −3 −4 −5 −6
-0.2 −10
(b)
−5
0
u
5
10
−7 −10
−5
(c)
0
u
5
10
Fig. 3.23 Ideal interpolation function sinc: a representation of the sinc tr uncated in the spatial domain, b its spectrum no longer with a rectangular profile, and c the spectrum in logarithmic scale for highlight the distortions with respect to the ideal rectangular profile
3.10 Geometric Transformation and Resampling
189
interpolating filter that passes the low frequencies and the eliminated frequency region. In other words, the deviation occurs (through the value of the module) of the constant gain within the bandpass that will be small, as much as possible (to avoid the blurring effect in the image), and the verification of the lateral lobes of the stop-band region that they must be as small as possible to avoid the effect of aliasing (distortion in the image). In the following paragraphs, we will describe some interpolating functions based on piecewise polynomials starting from the zero order up to the order 3, and an example of a non-polynomial interpolator.
3.10.2 Zero-Order Interpolation (Nearest-Neighbor) The simplest interpolation function is that of zero order (that is, interpolating function with zero-order polynomial) also called interpolation with the nearest pixel (nearestneighbor interpolation). When case (b) occurs in Fig. 3.3 we proceed to choose the four pixels closest to the one that has the shortest distance, choosing as pixel P0 (x , y ) to assign to the output image. The interpolation function is given by x y 1 if − 21 ≤ x ≤ 21 ; − 21 ≤ y ≤ 21 = (3.90) h 0 (x, y) = , x y 0 otherwise where in the graph x = y = 1. Instead of executing the convolution, in this case, the interpolation is obtained as follows: f I (x, y) = f s [I N T (x + .5), I N T (y + .5)]
(3.91)
where I N T (z) indicates the largest integer less than the real number z. Considering Fig. 3.24 we can write f I (P0 ) = f S (Pk )
k : i = 1, 4
find
dk = min di i
(3.92)
where Pk indicates the pixel in the input image closer to P0 , f S indicates the values of the input pixels, and di indicates the distance of each of the 4 pixels closer to the pixel P0 to be interpolated. This type of interpolation is better to use when we do not want to change the radiometric information of the image much (for example, when you want to preserve the original noise) or to reduce the calculation time in the elaboration of large images considering the minimum required calculation time with
Fig. 3.24 Zero-order interpolation
di fI(P )
y
fS(Pi) 0
x
190
3 Geometric Transformations
Fig. 3.25 Nearest-Neighbor Interpolation: a Interpolation function; b Fourier transform
h0(x)
5
1
H(u)
4 0.8
3
0.6
2
0.4
1
0.2
0
0
−1
−3
x
−2 −1 -.5 0 .5 1
2
3
−2 −5
0
5
zero-order interpolation. It also highlights that enlarging the image sensibly with this type of interpolation, the mosaic effect appears in the transformed image. The Fourier transform of the rectangular impulse h 0 is equivalent to a sinc function whose gain in the bandpass component falls rapidly and immediately afterward has prominent lobes on the sides of the central lobe (see Fig. 3.25). This evident deviation from the ideal filter introduces blurring and aliasing in the interpolated image as we will see in detail in the paragraph that compares the various interpolation methods considered. A further effect occurs in terms of displacement in the position of the interpolated pixels with respect to the initial coordinates. The computational complexity for the zero-order interpolation is O(N 2 ) with the image size of N × N pixels.
3.10.3 Linear Interpolation of the First Order A linear interpolation function is also called the first order or bilinear in the twodimensional case (see Fig. 3.26). The interpolation function is triangular in the spatial domain (see Fig. 3.27) and is defined by the following kernel function: x y (1 − |x|)(1 − |y|), i f |x| < 2, |y| < 2; = h 1 (x, y) =
, (3.93) 0 other wise x y Bilinear interpolation in practice performs the weighted average of the four pixels close to the pixel of interest of coordinates (x, y) and the interpolating function is defined in the following polynomial form of the first order h 1 (x, y) = ax + by + cx y + d
(3.94)
where the 4 coefficients are known to calculate the 4 closest interpolating points which, without losing the generality, we can indicate them with f S (0, 0), f S (1, 0), f S (0, 1), f S (1, 1) These pixels are identified by the coordinates (x, y) in the input image with the inverse geometrical transform (see the case (b) of Fig. 3.3). Figure 3.26 graphically reproduces this situation with the pixel to be interpolated located at point (a, b) where a and b, considering the image discretized on a square grid, take values between 0 and 1 referenced with respect to the coordinate pixel (0, 0). The pixel to be interpolated (and resampled) is indicated with f R (a, b).
3.10 Geometric Transformation and Resampling
191
x’
y’
fS(x,y)
x
fS(0,0)
fI(0,b)
fI(a,0)
fS(1,0)
fR(a,b)
fS(0,1)
y
fI(a,1)
(x,y)
fI(1,b)
(0,b)
a
1-a
1fR(a,b)
(0,1)
(a,b)
(a,1)
(1,b)
(1,1)
1-b
y
(1,0)
x
b (1,0)
1-a
b
a
fI(a,0)
b
(0,0)
fS(1,1)
(a,0)
(0,0)
(0,1)
fI(a,1)
(1,1)
Fig. 3.26 Bilinear interpolation
For simplicity, instead of explaining the calculation of the 4 coefficients of the polynomial (3.94), we can derive the formula to directly calculate the interpolated value f R (a, b) from geometric considerations and with linear interpolation along the sides of the quadrilateral consisting of the 4 adjacent pixels indicated above. Referring to Fig. 3.26 you can perform the first linear interpolation along the x axis between the coordinate pixels (0, 0) and (1, 0) obtaining an intermediate value f I (a, 0) given by the following formula: f I (a, 0) = f S (0, 0)+a[ f S (1, 0)− f S (0, 0)] = (1−a)· f S (0, 0)+a· f S (1, 0) (3.95) Similarly, with interpolation on the opposite trait defined by the coordinates (0, 1) and (1, 1), another intermediate interpolation value f I (a, 1) is obtained, given by f I (a, 1) = f S (0, 1)+a[ f S (1, 1)− f S (0, 1)] = (1−a)· f S (0, 1)+a· f S (1, 1) (3.96) With these calculated intermediate interpolated values, we can interpolate between the points (a, 0) and (a, 1) and obtain the desired interpolation in the coordinate point (a, b) with the following relation: f R (a, b) = f I (a, 0) + b[ f I (a, 1) − f I (a, 0)]
(3.97)
Substituting in this last equation the (3.95) and (3.96), and reassembling, we obtain the following bilinear interpolation formula: f R (a, b) = (1 − a)(1 − b) f S (0, 0) + a(1 − b) f S (1, 0) + b(1 − a) f S (0, 1) + a · b · f S (1, 1) (3.98)
To generalize, let’s consider an arbitrary pixel of coordinates (x , y ) in the output image, which with the inverse geometrical transform locates a position in the input image at the coordinates (x, y) thus identifying the 4 neighboring pixels indicated with the following coordinates (i, j), (i + 1, j), (i, j + 1), (i − 1, j + 1)
with
i = f i x(x); j = f i x(y)
192
3 Geometric Transformations
Fig. 3.27 Linear interpolation: a Triangular interpolation function; b its Fourier transform
H₁(u)
h₁(x) 5
1
4
0.8
3
0.6
2
0.4
1
0.2
0
0
−1
−3
−2
−1
0
1
2
3
a)
−2 −5
b)
0
5
where f i x indicates the rounding function to the smallest of a real number. In Fig. 3.26, the pixel of coordinates (0, 0) origin of the quadrilateral, corresponds to the generic pixel (i, j), while the coordinates of the pixel to be interpolated would be of coordinates (i + a, j + b). With these details, the bilinear interpolation formula for an arbitrary pixel is f R (x, y) = f I (i + a, j + b) = (1 − a)(1 − b) f S (i, j) + a(1 − b) f S (i + 1, j) + b(1 − a) f S (i, j + 1) + a · b · f S (i + 1, j + 1)
(3.99)
From Fig. 3.27 we observe the spectrum of the triangular interpolating function (still very far from the ideal one) but, compared to that of the nearest neighbor, it seems closer to that of the ideal rectangular function. In fact, it has a more suitable lowpass filter profile for passing more frequencies. We can see how the lobes outside the central lobe are even smaller but still present. Bilinear interpolation is computationally more expensive than zero-order interpolation. The computational complexity for bilinear interpolation is O(N 2 ) with the image size of N × N pixels.
3.10.4 Biquadratic Interpolation The Biquadratic interpolation is the two-dimensional version of the Quadratic interpolation that interpolates a surface on a uniformly spaced 2D grid. Quadratic interpolation can be achieved by interpolating first by rows and then on the intermediate result on the columns. Figure 3.28 shows a range of dimensions 3 × 3 of the grid whose intensity value is known f S (i, j), i = 0, 1, 2; j = 0, 1, 2 of the pixels to be interpolated to estimate the pixel value at the coordinates (x, y). Therefore, with biquadratic interpolation, the intensity of the interpolated pixel is evaluated using the intensity of the 9 neighboring pixels. The quadratic interpolation is based on a polynomial P2 (x) of the second order which in the canonical form, given by P2 (x) = a0 + a1 x + a2 x 2
(3.100)
Operating initially by rows, known for each of the 3 samples ( f S (mx), m = 0, 1, 2), for the estimate of the 3 coefficients ak , k = 0, 1, 2 the problem is reduced to the calculation of the 3 coefficients solving the system of 3 equations in 3 unknowns
3.10 Geometric Transformation and Resampling Fig. 3.28 Biquadratic interpolation performed with 9 points identified near the point with coordinates (x, y) in the input image, calculated from the inverse geometric transformation. Interpolation occurs first on the pixel triad along the x axis (3 interpolated intermediate values are calculated) and then requires a single interpolation of these intermediate values along the y axis to have the interpolated value in (x, y)
193
x’
y’ x fI(x,0)
(0,0)
(1,0)
(2,0)
(1,1)
(2,1)
(1,2)
(2,2)
y
(0,1)
fR(x,y)
(0,2)
fI(x,1) (x,y)
fI(x,2)
obtained by imposing the 3 passing conditions for the considered pixels (method of the indeterminate coefficients) 2
ak xmk = f S (mx)
m = 0, . . . , 2
(3.101)
k=0
In analogy to the procedure used for bilinear interpolation also in this case, note the intensity values f S (m, n) of the nine neighbors (interpolation grid) you can calculate the 3 interpolating polynomials related to the 3 rows of the grid 3×3, as shown in the figure, thus obtaining intermediate interpolation values f I (x, m), m = 0, 1, 2 and then on these interpolation values along the y coordinate to obtain the interpolated value at the desired point (x, y). The continuous interpolation process moves the interpolation grid along the rows and always interpolates on the 9 adjacent pixels. An implementation based on 1D convolution can be used with the interpolating kernel given by ⎧ i f |x| < 1/2; ⎨ 1 − 2x 2 , h 2 (x) = 2(|x| − 1)2 , i f 1/2 < |x| ≤ 1 (3.102) ⎩ 0 other wise with the definition interval (−1, 1) which guarantees the continuity and the nonnegativity of the interpolating function. Figure 3.29 shows the 1D kernel of the quadratic interpolation described by the (3.102) along with its Fourier transform. Biquadratic interpolation in the applications of geometric transformations for images proved to be not very effective for the discontinuity of the parabolic interpolating function. For this reason, in image processing applications, in addition to
194
3 Geometric Transformations
1.3
Quadratic - FT Module
Quadratic interpolation kernel function
Quadratic - FT Module Log
1
1 H(u)
H(u)
1
0.5
0
0
0
-3
-2
a)
-1
0
1
2
3
-3
-2
-1
b)
x
0
1
2
3
-10
u
-5
c)
0
u
5
10
Fig.3.29 Kernel Function of Quadratic Interpolation: a Interpolation function in the spatial domain; b Fourier transform module; c Fourier transform with logarithmic representation of the module Fig. 3.30 Bicubic interpolation performed with 16 points identified near the point with coordinates (x, y) in the input image, calculated from the inverse geometric transformation. Interpolation takes place by calculating 4 cubic polynomials pk (x) on the pixel quads along the x axis (4 interpolated intermediate values are calculated) and then requires a single interpolation of these intermediate values along the y axis to have the value interpolated in (x, y)
x’
x
y’ y
(0,0)
(1,0)
(0,1)
(1,1)
fR(xu,yv) (0,2)
(1,2)
(2,0)
(3,0)
(2,1)
(3,1)
(2,2)
(3,2)
(xu,yv)
x) q(y) (0,3)
(1,3)
(2,3)
(3,3)
bilinear interpolation, the bicubic interpolations and B-splines described in the following paragraphs are widely used, despite the greater computational complexity.
3.10.5 Bicubic Interpolation Bicubic interpolation is the two-dimensional version of the Cubic interpolation that interpolates a surface on a uniformly spaced 2D grid. Bicubic interpolation [8] can be done by first interpolating by rows and then by the columns of the intermediate results. Figure 3.30 shows a range of dimensions 4 × 4 of the grid whose intensity value is known f S (i, j), i = 0, 3; j = 0, 3 of the pixels to be interpolated for estimate the pixel value at the coordinates (x, y). Therefore, with bicubic interpolation, the
3.10 Geometric Transformation and Resampling
195
intensity of the interpolated pixel is evaluated using the intensity of the 16 neighboring pixels. The cubic interpolation is based on a polynomial P3 (x) of the third order which in the canonical form is given by P3 (x) = a0 + a1 x + a2 x 2 + a3 x 3
(3.103)
Initially working by rows, known for each of them 4 samples f S (mx), m = 0, 3, for the estimate of the 4 coefficients ak , k = 0, 3 the problem is reduced to calculation of the 4 coefficients by solving the system of 4 equations in 4 unknowns obtained by imposing the 4 transition conditions for the considered pixels (indeterminate coefficients method) 3
ak xmk = f S (mx)
m = 0, . . . , 3
(3.104)
k=0
Post this last equation in matrix terms with F I = [ f S (m 0 ), . . . , f S (m 3 )]t where a = [a0 , . . . , a3 ]t and the matrix V = {(1 xk xk2 xk3 ), k = 0, 3} called Vandermonde matrix, the system to be solved is V · a = FI The interpolation polynomial can be considered in the formula of Lagrange and solved in terms of minimization in the direction of the least squares to avoid any cases of instability of the Vandermonde matrix. In this case, having distinct pixels, the matrix is nonsingular and the solution is unique. As shown in the figure, to obtain the value of the interpolated pixel f I (xu , yv ), it is necessary to first calculate the coefficients of the 4 cubic polynomials pk (x), k = 0, 3 along the rows (in the direction of the x axis). For each row cubic polynomial the interpolated values corresponding to pk (x), k = 0, 3 are calculated as follows: f I (xu , k) = pk (x) = a0,k + a1,k x + a2,k x 2 + a3,k x 3
k = 0, 1, 2, 3
(3.105)
where with ai, j the coefficients of the 4 polynomials are indicated along the lines, i.e., the index i indicates the ith coefficient of the polynomial pk (x). These interpolated values of pk (x) constitute the intermediate results to be used later to interpolate along the columns. Next, the coefficients of the only cubic polynomial q(y) = a0 + a1 y + a22 + a3 y 3 are determined along the columns (in the direction of the y axis) and the corresponding interpolated value is evaluated q(yv ). In this case, the system to be solved has the Vandermonde matrix consisting of V = {(1 yk yk2 yk3 ), k = 0, 3} with the 4 equations conditioned by intermediate interpolation results pk (x) as follows: a0 + a1 yk + a2 yk2 + a3 yk3 = pk (xu ) = f I (xu , k)
k = 0, 1, 2, 3
(3.106)
196
3 Geometric Transformations
where ai are the coefficients to be determined of the cubic polynomial along the columns. Finally, the interpolated value of the pixel intensity f I (xu , yv ) is obtained as follows: (3.107) f R (xu , yv ) = q(yv ) = a0 + a1 yv + a2 yv2 + a3 yv3 This bicubic interpolation process described is applied to all the pixels to be resampled by translating the interpolating grid with dimensions 4×4 for each interpolating point. In geometric transformations, the interpolated data are normally spaced uniformly and the interpolation process can be set up as a convolution process characterized by the interpolation function h(x, y) operating in a delimited spatial domain. Rifman and McKinnon [9] have proposed a convolution-based cubic interpolation algorithm that efficiently approximates the ideal interpolation function sinc and thus has significant advantages also in computational terms. The kernel function used is composed of piecewise cubic polynomials and defined in a space domain (−2, 2) of pixels and produces zero interpolated value for values outside this domain. Each interpolated pixel is centered in the definition interval of the interpolating function that has the properties of symmetry, a typical invariant space of a filter and has the following form: ⎧ f or 0 ≤ |x| < 1 ⎨ (a + 2)|x|3 − (a + 3)|x|2 + 1 (3.108) h 3 (x) = a|x|3 − 5a|x|2 + 8a|x| − 4a f or 1 ≤ |x| < 2 ⎩ 0 f or 2 ≤ |x| where the parameter a is controlled by the user to modify the profile of the interpolating function. In particular, it changes the depth of the external lobes in the negative part. Placing a between 0 and −3 tends to have a profile that resembles the sinc. The author used a = −1 to have an equal slope to that of the sinc for x = 1. This choice amplifies the high frequencies typical of an edge enhancement filter. This parameter must also be defined considering the frequencies present in the image to be interpolated. Figure 3.31 shows the 1D kernel of the bicubic interpolator for different values of the a parameter in the spatial domain and its relative Fourier
1
Cubic - FT Module Log a=-0.5
Cubic - FT Module a=-0.5
Cubic Interpolation Kernel Function a=−3 a=0 a=−..5 a=−1
1
1
H(u)
H(u)
0..5
0
0 -0.5 −3
a)
−2
−1
0
x
1
2
3
-3
0 -2
b)
-1
0
u
1
2
3
c)
-10
-5
0
5
10
u
Fig. 3.31 Cubic Interpolation Kernel Function: a Interpolating functions in the spatial domain calculated with different values of the parameter a; b Fourier transform module for the kernel calculated with a = −0.5; c Fourier transform related to the figure b with logarithmic representation of the module
3.10 Geometric Transformation and Resampling
197
transform only for the kernel function calculated for a = −0.5. For a = −1 the one-dimensional cubic interpolation function results ⎧ if 0 ≤ |x| < 1 1 − 2|x|2 + |x|3 ⎨ h 4 (x) = 4 − 8|x| + 5|x|2 − |x|3 (3.109) if 1 < |x| < 2 ⎩ 0 other wise For a = −0.5, we have the following kernel function: ⎧ if |x| ≤ 1 ⎨ 1 − 25 |x|2 + 23 |x|3 h 5 (x) = 2 − 4|x| + 25 |x|2 − 21 x|3 if 1 < |x| < 2 ⎩ 0 other wise
(3.110)
Several other types of kernel functions based on bicubic functions are described in the literature. Bicubic interpolation eliminates the problem of smoothing generated by bilinear interpolation and is particularly used to enlarge the image. The computational complexity for the cubic polynomial interpolation is O(N 2 ) with the image size of N × N pixels.
3.10.6 B-Spline Interpolation Polynomial interpolation always tries to find a unique function that precisely matches the data or approximates them as much as possible within an acceptable error. A more efficient method of interpolation is to define a piecewise interpolating function whose traits are defined with low-order polynomial functions. This interpolation method is called spline [Schoenberg] [10] and can be of type linear, quadratic or cubic (the most used). A characteristic of the splines is that the adjacent polynomial lines are welded smoothly (presenting a slight curvature) that is, with the continuity of the first derivatives. There are various solutions in the literature to eliminate the phenomena of Gibbs [7] (formation of steps) and of Runge [7] (deformation at the ends of the interval) through the smoothing spline widely used in the graphics computational. Basically, the spline shapes a series of known points, without necessarily passing through them, but they try to link their progression smoothly. In the context of image interpolation (Hou and Andrews [6]), cubic spline is used efficiently. On n points xk equidistant are defined n − 1 spline curve tracts using cubic polynomials that cross the n points called control points (in which the value of the function f (x) is known) for which the interpolating spline functions f k of each trait are satisfied f k (x) = a0 + a1 (x − xk ) + a2 (x − xk )2 + a3 (x − xk )3
(3.111)
where the coefficients ak , k = 0, 3 are calculated known the control points and the first and second derivatives (the latter is zero in the extreme points). The spline is often complicated by calculating each trait. In the various applications, the B-spline (base-spline) is used, given by the sum of spline which has a null value outside their range. B-splines play a central role in kernel function design for image interpolation and resizing. The B-spline forms an invariant basis for the
198
3 Geometric Transformations
space of polynomials with degrees of n with n − 1 continuous derivatives at the junction points. The B-spline of degree n is derivable through n 1D convolutions of the rectangular base function B0 defined as follows: 1 i f |x| ≤ 21 B0 (x) = (3.112) 0 i f |x| > 21 and consequently, a B-spline function of degree n is defined by the following expression: Bn (x) = B0 (x) ∗ B0 (x) ∗ · · · ∗ B0 (x) = Bn−1 (x) ∗ B0 (x)
n+1 times The related Fourier transform is given by sin π u n+1 Bn (2π u) = πu
(3.113)
(3.114)
From the latter it is possible to derive a more explicit expression in the spatial domain of a B-spline of order n which results n n+1 n+1 (−1)k n + 1 Bn (x) = x+ −k k n! 2 + k=0 (3.115) n+1 n n+1 (−1)k (n + 1) x+ = −k (n + 1 − k)!k! 2 + k=0
n x+
where the term indicates the truncated power of the max(0, x)n function. The Bspline functions represented by the (3.115) are piecewise polynomials of order n with symmetry characteristic (spatially invariant, and therefore, suitable in convolution processes) and n − 1 times differentiable. For n = 0 we obtain the (3.112) or the Bspline B0 of order zero. For n = 1 we obtain the B-spline of order one, the equivalent to the B1 = B0 ∗ B0 , which corresponds to the kernel function of the note triangular interpolating function (x). Basically the B-spline of the first order corresponds to a linear interpolation (it joins the nodes with straight lines). The first two members B0 and B1 of the B-spline family are graphically identical to the nearest-neighbor interpolation functions (Fig. 3.25) and linear (Fig. 3.27) described above. In particular, B0 is almost identical to the nearest-neighbor function except its definition within the interval of the Eq. (3.90). Instead, the B-spline B1 is exactly the same as the linear interpolating function (Fig. 3.27). For these reasons, the Fourier transforms of the first two members of the B-spline family are equivalent to those of the nearest-neighbor and triangular functions shown in Figs. 3.25b and 3.27b. For the first two B-spline members the same considerations apply (the interpolation kernel function h(x) has a finite definition interval) made for the interpolating rectangular and triangular functions for the purposes of the results obtained for the geometric transformations on the images. In particular, due to the presence of the prominences of the lateral lobes (lower in the first order), visible in the frequency
3.10 Geometric Transformation and Resampling
199
domain, it follows a significant modification of the functional characteristics of an ideal band-pass filter. In order to have a functionality that tends to the ideal one, that is, a kernel function that tends to the sinc function, the B-spline degree n must be very large. We will see in the following paragraphs that, already for n = 3, with the cubic B-spline, we obtain performance kernels with impulsive response very similar to the sinc.
3.10.6.1 Quadratic B-Spline A second order B-spline (also called a quadratic B-spline) is obtained with the following convolutions: B2 (x) = B0 (x) ∗ B0 (x) ∗ B0 (x) = B0 (x) ∗ B1 (x)
(3.116)
The curve traits which join adjacent nodes in this case are described by parables or functions of the type f k (x) = a0 + a1 (x − xk ) + a2 (x − xk )2
(3.117)
where 3 coefficients are needed and the B-spline function requires 3 control points. From the (3.115) we can derive for n = 2 the explicit function of the quadratic B-spline and consequently determine the relative kernel function that results ⎧ 3 if |x| < 21 ⎨ 4 − |x|2 1 3 B2 (x) = 21 (|x| − 3/2)2 (3.118) if 2 ≤ |x| < 2 ⎩ 0 other wise Quadratic B-spline is not preferred over cubics because it often generates oscillating results in the union of strokes. Figures 3.32a, b show the quadratic B-spline function in the spatial and frequency domain, respectively. We highlight the profile of the B2 (x), compared with those nearest-neighbor and linear , it approaches the characteristics of a band-pass and band-stop filter. The function is almost positive throughout the interval [−2.2] in the spatial domain. 1
1
B-spline 2 FT Module
H(u)
B-spline 2
0.5
0 -3
-2
a)
-1
0
x
1
2
3
B-spline 2 - FT Module Log
H(u)
1
0
0 -3
b)
-2
-1
0
u
1
2
-10
3
c)
-5
0
5
10
u
Fig. 3.32 Quadratic B-spline interpolation: a Interpolation function in the spatial domain B2 (x); b Fourier transform; c Fourier transform with logarithmic representation of the module
200
3 Geometric Transformations
0 -2
a)
-1
0
x
1
2
3
1
B-spline 3 FT Module
H(u)
0.5
0 -3
1
B-spline 3
B-spline 3 - FT Module Log
H(u)
1
0 -3
-2
b)
-1
0
u
1
2
3
-10
-5
c)
0
5
10
u
Fig. 3.33 Cubic B-Spline interpolation: a Interpolation function in the spatial domain B3 (x); b Fourier transform; c Fourier transform with logarithmic representation of the module
3.10.6.2 Cubic B-Spline The expectation of a piecewise interpolating function, although composed of interpolated traits with low-order polynomials, is that of having an overall curve with little oscillation as much as possible. Mathematically this means to find an interpolating function that in every trait has a first derivative it does not abruptly vary. It is shown that there is only one solution to this problem given by the cubic spline (3.111). A cubic B-spline B3 is defined by the following iterated convolutions of the zero order B-spline function (3.119) B3 (x) = B0 (x) ∗ B0 (x) ∗ B0 (x) ∗ B0 (x) = B0 (x) ∗ B2 (x) From the (3.115) we can derive for n = 3 the explicit function of cubic B-spline and consequently determine the relative kernel function that results ⎧ 2 1 2 3 if |x| ≤ 1 ⎨ 3 − x + 2 |x| 1 3 2 (3.120) B3 (x) = − 6 |x| + |x| − 2|x| + 43 if 1 ≤ |x| < 2 ⎩ 0 other wise Recall that the cubic B-spline function, unlike the cubic kernel function, does not have the pure interpolator constraint, i.e., to satisfy the kernel in the points h(0) = 1, h(1) = h(2) = 0 (zero crossing in the side lobes as the sinc) as it models the curve to approximate as much as possible and remains positive (no negative data). This determines an advantage in image processing applications in particular for displaying results. Figure 3.33b, c show the cubic B-spline function in the spatial and frequency domain, respectively. Also, in this case, we highlight the profile of the B3 (x), compared to the nearest-neighbor and linear, it is getting closer and closer to the characteristics of a band-pass and band-stop filter. The function is almost positive throughout the interval [−2.2] in the spatial domain. The computational complexity for the cubic B-spline interpolation is O(N 4 ) with the image size of N × N pixels.
3.10.7 Interpolation by Least Squares Approximation An interpolation function, more generally, can be defined with a polynomial of a higher order than the third one interpolating n + 1 nodes. In this case, the coefficients
3.10 Geometric Transformation and Resampling
201
of the polynomial that approximates the interpolating function h(x, y), are calculated by the least squares method. Attention, this does not mean that the polynomial must necessarily be of a high degree. With this method it is not necessary that the polynomial P(x), normally of degree s < n, passes exactly from the control points to satisfy the equalities P(xk ) = f S (k),
k = 0, 1, . . . , n
but we want to minimize, as much as possible, the following functional: [P(x0 ) − f S (0)]2 + [P(x1 ) − f S (1)]2 + · · · + [P(xn ) − f S (n)]2
(3.121)
One can demonstrate the uniqueness of the existence of a polynomial of degree s < n such that the said polynomial approximates the nodes to the least squares. This method is used when data is noisy. The least squares method, which imposes a square matrix of the coefficients A of dimensions s × s, with invertible matrix, admits only solution solving the known system of the type At Ax = At b
(3.122)
with the symmetric matrix At A of size s × s.
3.10.8 Non-polynomial Interpolation A non-polynomial interpolation function is proposed by Cornelius Lanczos [3]. Interpolation is defined by a kernel function h L (x) realized by the product of the normalized sinc function and the Lanczos sinc window. The latter is a sinc, in the form sinc(x/a), with the constraint −a ≤ x ≤ a where the parameter a controls the horizontal scalability of the central lobe. The kernel is defined as follows: if −a ≤ x < a sinc(x)sinc ax h L (x) = (3.123) 0 other wise or ⎧ 1 if x =0 ⎨ x/a) (3.124) h L (x) = a·sin(π πx)2sin(π i f 0 < |x| < a x2 ⎩ 0 other wise The parameter a determines the size of the kernel function and the number of lateral lobes (equal to 2a − 1) and normally takes the typical values of 2 or 3 (see Fig. 3.34). The Lanczos interpolator shows a good compromise between reduction of aliasing, the effect of ringing and preserving details especially when the image is not very subsampled. It can be implemented as a convolution with the kernel function obtained from the product of two 1D kernel functions and the computational complexity is O(N × 4a 2 ).
202
3 Geometric Transformations Lanczos Interpolation Kernel Function
1.2 1
Lanczos a=2 - FT Module Log
1
a=1 a=2 a=3
0 −1
0.8
HL(u)
hL(x)
−2 0.6 0.4
−3 −4
0.2
−5
0
−6
−0.2 −3
a)
−2
−1
0
x
1
2
3
−7 −3
−2
b)
−1
0
1
2
3
u
Fig. 3.34 Lanczos non-polynomial interpolation: a Interpolation function in the spatial domain h L (x) for values of the parameter a = 1, 2, 3; b Fourier transform of the Lanczos kernel function for a = 2 with logarithmic representation of the module
3.10.9 Comparing Interpolation Operators The problem of interpolation in the context of processing signals and images concerns the reconstruction of the continuous signal in the missing points starting from discrete samples of the same. In the process of image formation of a natural scene the inverse problem was addressed. The optical system, modeled as a spatially linear low-pass filtering process together with the digitization system activated at the sample rate compatible with the spatial resolution of the sensor produces a discrete image from the analog continuous image. From the sampling theory, we know that it is possible to reconstruct the original continuous image starting from the sampled one with the interpolating function sinc. In reality, it is known that the optical system introduces the effect of blurring in the image and an inadequate sampling can distort the continuous signal producing the phenomenon of aliasing. An image acquisition system attempts to balance these two phenomena by reducing them to the minimum possible. The interpolation methods described in the previous paragraphs reconstruct the continuous signal through the convolution process starting from the samples and using some appropriate continuous interpolating kernel functions. With interpolation, it is not possible to recover exactly the original continuous scene also because the acquisition characteristics (optical system, sampling, acquisition mode, etc.) are not always known. However, we can extract a continuous signal well modeled from the discrete data available and then resample the continuous image reconstructed (locally) in a new discrete image on a uniform grid. Obviously, this resampling process does not improve resolution because it is not possible to apply an inverse process (ill-posed inverse problem), such as deconvolution, even knowing the previous acquisition characteristics. In the context of the geometric transformations of images, the problem of interpolation becomes essential in particular for transformations that require rotations,
3.10 Geometric Transformation and Resampling
203
enlargements or reductions, and in general complex geometric corrections. In the various application sectors, it may be necessary to give greater priority to the aspect of geometric correction of the image (as during the acquisition, strong deformations were introduced) with respect to a possible radiometric alteration introduced by interpolation. This often happens in the remote sensing sectors. On the contrary, in sectors where the visual quality of the image is important, where important enlargements are often required, interpolation algorithms are strategic and it is necessary to choose those that minimize artifacts as much as possible (ringing effect, mosaic, aliasing). To have a broad comparison scenario, the various interpolation methods are tested in different geometric transformations and for different types of images. In particular, the test image Lena was used to have a dominance of low frequencies, the H ouses image that presents high frequencies with replicated rectilinear geometric structures, and the concentric rings image that presents variable radial frequencies. For a quantitative evaluation, the values of the RMSE (Root Mean Square Error) and PSNR (Peak Signal-to-Noise Ratio) are reported for each geometric transformation and interpolation, respectively, the root of the mean quadratic error and the peak of the signal-to-noise ratio to evaluate the quality of the image associated with the method interpolation (for the same geometric transformation) with respect to the starting image. The RMSE measures the discrepancy between the original and processed images.3 The PSNR measure (expressed in terms of the logarithmic scale of decibel d B, see Sect. 6.11 Image quality Vol. I) should be considered not as an absolute value but as a reference measure to quantitatively evaluate how the performance of interpolation algorithms with equal geometric transformation varies.
3.10.9.1 Examples of Image Interpolation with Reductions and Enlargements The 512 × 512 size test images were first reduced by a factor of 8 and then enlarged to their original dimensions. For this type of geometrical transformation, the interpolation methods were applied from the simplest to the most complex ones in order to reduce the mosaic effect as much as possible. Figure 3.35 shows the results of the various interpolation methods (Nearest-Neighbor, Linear, Cubic, Lanczos2, Lanczos3) for the images Lena and H ouses. The best result obtained with the two non-polynomial interpolating kernel functions is highlighted, in particular in the re-enlarged images by a factor of 8 starting from the size of 64 × 64.
3 In
essence, RMSE compares the predicted value (in this case the intensity of the pixels of the original image) and the observed value (in this case the intensity of the pixels of the processed image) defined as follows: N 2 i=1 ( pr edvali − obsvali ) ) RMSE = N where N indicates the number of pixels in the image.
204
3 Geometric Transformations Rid. 8 Nearest
Rid. 8 Linear
i8n RMS:10.3 PSNR:68 i8b RMS:8.0 PSNR:70
Rid. 8 Nearest
Rid. 8 Linear
Rid. 8 Cubic
i8c RMS:6.7 PSNR:72
Rid. 8 Cubic
Rid. 8 Lanczos2
Rid. 8 Lanczos3
i8l2 RMS:6.7 PSNR:72 i8l3 RMS:6.5 PSNR:72
Rid. 8 Lanczos2
Rid. 8 Lanczos3
i8n RMS:18.4 PSNR:62 i8b RMS:14.1 PSNR:62 i8c RMS:13.6 PSNR:65 i8l2 RMS:13.6 PSNR:65 i8l3 RMS:13.2 PSNR:66
Fig. 3.35 Results of the image reduction (first and third line) and image magnification (second and fourth lines) of a factor 8. From left to right interpolating kernel function: nearest-neighbor, linear, cubic, lanzcos2 and lanzcos3. The R M S E and P S N R measures are calculated between the original image of 512 × 512 and the image rebuilt after the magnification starting from the size of 64 × 64
At the quantitative level, the discrepancy between the re-enlarged image and the original one is evaluated by the R M S measures that are decreasing, as expected, i.e., with an increasing quality of results for the interpolation kernel functions in order: nearest-neighbor, linear, cubic, lanzcos2, and lanzcos3. The absolute measure of the P S N R is less significant, even if it confirms better values in the same order as the previous kernel functions. Considering the high reduction and enlargement factor, the mosaic effect was obvious on the first two interpolators, as expected, the worst results on the houses image considered the dominance of the high frequencies present. This is confirmed by the R M S E values with an initial value of 18.4% (nearest-neighbor) for the H ouses image while it is 10.3% for the Lena image, and final value of 6.5% and 13.2%, respectively (lanzcos3). Recall that this is due to the poor quality of the nearest-neighbor kernel function which is not a good low-pass filter due to the presence of external lobes in the transfer function (sinc). Repeating the tests with a reduction and magnification of a factor of 4 significantly improves the visual quality
3.10 Geometric Transformation and Resampling
RMS: 8.591 PSNR: 69.45
Nearest Rot: 15x24
Bilinear Rot: 15x24
Nearest Rot: 15x24
RMS: 7.6676 PSNR: 70.4377
Bilinear Rot: 15x24 RMS: 13.1961 PSNR: 65.7219
RMS: 17.6987 PSNR: 63.172
Bicubic Rot: 10x36
Bilinear Rot: 10x36 RMS: 9.2273 PSNR: 68.8293
RMS: 12.0786 PSNR: 66.4905
Nearest Rot: 10x36
Bicubic Rot: 15x24 RMS: 10.031 PSNR: 68.1039
RMS: 10.5119 PSNR: 67.6971
RMS: 10.6608 PSNR: 67.575
Original
Bicubic Rot: 10x36 RMS: 4.7168 PSNR: 74.6578
Bilinear Rot: 10x36 RMS: 6.983 PSNR: 71.2499
Nearest Rot: 10x36
Bicubic Rot: 15x24 RMS: 15.4004 PSNR: 64.3802
Original
205
Fig. 3.36 Results of multiple image rotation. The first row shows the results for the Lena image rotated 10 times of 36◦ with the application of 3 interpolation methods: nearest-neighbor, linear and cubic. The second row shows the results for the same image rotated 15 times by 24◦ with the same interpolating kernel functions. The R M S E and P S N R measures are calculated between the original image of 512 × 512 and the image obtained after the multiple rotations of a total of 360◦ . The third and fourth rows show the results for the H ouses image operating under the same conditions
of the images and attenuates the value of R M S E from 7 to 5% for the image Lena, and from 14.6 to 10% for the H ouses image.
3.10.9.2 Examples of Image Interpolation with Multiple Rotations The tests performed in this case concerned the iterated rotation of the image several times to achieve a total rotation of 360◦ . The goal is to evaluate the error accumulated with iterated interpolation with the different interpolation kernel functions (nearest-
206
3 Geometric Transformations Nearest Rot: 10x36
Bilinear Rot: 10x36
Bicubic Rot: 10x36 RMS: 8.9 PSNR: 69.0
RMS: 10.3 PSNR: 67.8 Bilinear Rot: 15x24
RMS: 24.4 PSNR: 60.3
Nearest Rot: 15x24
RMS: 22.3 PSNR: 61.1
RMS: 28.4 PSNR: 59.0
RMS: 15.9 PSNR: 64.0
Original
Bicubic Rot: 15x24
Fig. 3.37 Multiple rotation results for images concentric rings. The first row shows the results for the 10 times rotated image of 36◦ with the application of 3 interpolation methods: nearest-neighbor, linear and cubic. The second row shows the results for the same 15 times rotated image of 24◦ with the same interpolating kernel functions. The R M S E and P S N R measures are calculated between the original image of 512 × 512 size and the image obtained after multiple total rotations of 360◦ Nearest-Neighbor t
Linear
Cubic Matlab
50
50
50
100
100
100
150
150
150
200
200
200
250
250
250
300
300
300
350
350
50
100
150
200
250
300
350
350
50
100
Cubic Hermite
150
200
250
300
350
50
50
50
50
100
100
150
150
150
200
200
200
250
250
250
300
300
300
350
350
350
400
400
100
150
200
250
300
350
400
50
100
150
200
250
150
200
250
Cubic Spline
100
50
100
Cubic h_5
300
350
400
400
50
100
150
200
250
300
350
400
Fig. 3.38 Evaluation of the various interpolation methods for image magnification with high frequency dominance. The interpolation methods used: nearest-neighbor, linear, cubic matlab, cubic Hermite, cubic h 5 (Eq. 3.110), cubic spline
3.10 Geometric Transformation and Resampling Nearest-Neighbor θ=24° Sc=4x
207
Linear θ=24° Sc=4x
0
Cubic h_5 θ=24° Sc=4x
0
0
50
50
50
100
100
100
150
150
150
200
200
200
250
250
250
300
300
300
350
350
350
400
400
400
450
450
−200
−100
0
100
200
300
−200
450 −100
0
100
200
300
−200
−100
0
100
200
300
Fig. 3.39 Evaluation of the interpolation methods for the magnification of a factor of 4 and rotation (24◦ ) of the image with the dominance of high frequencies. The interpolation methods used are: nearest-neighbor (image on the left), linear (image in the middle), cubic h 5 (image on the right) (Eq. 3.110)
neighbor, linear, and cubic) for the different test images (Lena, Houses, concentric rings). The test images were rotated 10 and 15 times, respectively, of 36◦ and 24◦ for a total rotation of 360◦ thus obtaining the image in the starting orientation. This also makes a quantitative assessment possible by evaluating the discrepancy between the source image and the one after the interpolated iterations. In Figs. 3.36 and 3.37 the results of the multiple rotations applied on the 3 test images are reported. At a qualitative level, by enlarging the processed images, one immediately observes the effect of blurring on interpolated images with the linear kernel and the mosaic effect (with the saw-tooth effect on the edges) on the interpolated images with the nearest-neighbor kernel function. These deformations are accentuated more on the images with 15 rotations. Particularly in the concentric rings image (see Fig. 3.37) one observes in the high frequencies a noticeable loss of details on the first two interpolators while the cubic one is performing. This is confirmed quantitatively by analyzing the values of R M S E that decrease significantly on images with lower rotations and on those with low dominant frequencies (Lena).
3.10.9.3 Examples of Image Interpolation by Combining Magnification and Rotation In the Figs. 3.38 and 3.39 are shown the effects of interpolation methods for geometric transformations composed of rotation and magnification on the image. In this case, we want to verify how the details of the test image are altered in the combined geometric transformation of rotation and of high magnification. The interpolation methods used are: nearest-neighbor, linear, cubic matlab (present in Matlab library), cubic Hermite (indicated with pchi p in matlab), cubic h 5 (Eq. (3.110)), cubic spline. The interpolated image with the linear kernel function still has the effect of smoothing due to the attenuation of high frequencies and there is still the effect of aliasing due to the persistence of the lateral lobes of the corresponding transfer function. Compared
208
3 Geometric Transformations
to the nearest neighbor it still has a stepped effect on the edges even if reduced by the smoothing effect. Figure 3.38 highlights the best results with the cubic B-spline interpolation for the enlarged image of a factor of 4 and rotated by 24◦ where the information on the high frequencies is not altered and not they present a significant blurring.
References 1. A. Ben-Israel, T.N.E. Greville, Generalized Inverses. Theory and Applications, 2nd edn. (Springer, New York, 2003). ISBN 0-387-00293-6 2. D.C. Brown, Decentering distortion of lenses. Photogramm. Eng. 32(3), 444–462 (1966) 3. M.J. Burge, W. Burger, Principles of Digital Image Processing: Core Algorithms, 1st edn. (Springer, Berlin, 2009). ISBN 978-1-84800-194-7 4. E. Catmull, A.R. Smith, 3-d transformations of images in scanline order. Comput. Graph. 14(3), 279–285 (1980) 5. R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, 2nd edn. (Cambridge University Press, Cambridge, 2003) 6. H.S. Hou, H.C. Andrews, Cubic splines for image interpolation and digital filtering. IEEE Trans. Acoust. Speech Signal Process. 26, 508–517 (1978) 7. A.J. Jerri, The Gibbs Phenomenon in Fourier Analysis, Splines, and Wavelet Application, 1st edn. (Kluwer Academic, New York, 1998) 8. R. Keys, Cubic convolution interpolation for digital image processing. IEEE Trans. Acoust. Speech Signal Process. 29(6), 1153–1160 (1981) 9. S.S. Rifman, D.M. McKinnon, Evaluation of digital correction techniques for ERTS images. Technical Report TRW 20634-6003-TU-OO (TRW Corporation, 1974) 10. I.J. Schoenberg, Cardinal Spline Interpolation. Regional Conference Series in Applied Mathematics, vol. 12 (1973) 11. Z. Zhang, A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000)
4
Reconstruction of the Degraded Image: Restoration
4.1 Introduction The radiometric reconstruction of an image, also called restoration [1], indicates a set of techniques that perform quantitative corrections on the image to compensate for the degradations introduced during the acquisition and transmission process. These degradations are represented by the fog or blurring effect caused by the optical system and by the motion of the object or the observer, by the noise caused by the optoelectronic system and by the nonlinear response of the sensors, by random noise due to atmospheric turbulence or, more generally, from the process of digitization and transmission. The enhancement techniques tend to reduce the degradation present in the image to a qualitative measure, improving its visual quality without having any knowledge of the degradation model, but often using heuristic techniques such as, for example, with contrast manipulation techniques. Restoration techniques are used instead to eliminate or attenuate in a quantitative way the degradations present in the image, starting also from the hypothesis of knowledge of degradation models. Restoration techniques essentially recover the original image (without degradation) from the degraded image through an inverse process of the hypothesized degradation model (for example, de-blurring, additive noise,…). Compared to methods of improving the visual qualities of an image, restoration methods are based on robust physical–mathematical models (also referred to as denoising) to eliminate or mitigate degradation with its objective evaluation. That said, in the literature, there is no rigorous classification to univocally indicate all the methods that lead back to restoration. In fact, some filtering-based enhancement algorithms can also be used for restoration, particularly those that operate in the frequency domain or in the spatial domain like the median filter. For clarity in the discussion, we will consider the restoration problem as a problem of recovering a degraded image, assumed discretized, through an inverse process that relates the data of the observed image with those predicted by a model. The goal is to define an optimal restoration model that recuperates from the observed image an image as close as possible to the ideal one. © Springer Nature Switzerland AG 2020 A. Distante and C. Distante, Handbook of Image Processing and Computer Vision, https://doi.org/10.1007/978-3-030-42374-2_4
209
210
4 Reconstruction of the Degraded Image: Restoration
4.2 Noise Model To calculate the noise model, it is necessary to know, or hypothesize, the degradation models. These models can be expressed in an analytical form, or in an empirical way. Analytical models derive from the mathematical description of the degradation process. Many degradations are modeled by the knowledge of their transfer function or by the knowledge of the impulsive response of the acquisition system. Empirical models are used when there is not sufficient knowledge about the type of degradation. In this case, useful information can be obtained by looking at the degraded image in the spatial domain or in the frequency domain. For example, in the previous chapter on Geometric Transformations, the image degradations were corrected without knowing the analytical description, but first selecting some fiducial points in the degraded image and in the ideal image and, subsequently, approximating with a polynomial function the geometric deformation. Analogously it can be done in the frequency domain to obtain an approximate numerical form of the transfer function associated with the degradation to be removed. In the spatial domain, the degradation process, introduced during the image acquisition phase, can be modeled by the impulsive response associated with a spatially invariant linear system and consequently, the degradation process can be traced back to the convolution operation. Figure 4.1 schematizes a degradation process of an ideal image f I which is degraded by a linear spatially invariant system whose impulse response is given by h D and an additive noise source η(m, n). The degraded image g D obtained from the spatially invariant linear system can be described with the following convolution operation: g D (m, n) = f I (m, n) ∗ h D (m, n)
m = 0, 1, . . . , M − 1, n = 0.1, . . . , N − 1
(4.1)
and starting from Eq. (4.1), we can therefore think we can estimate the original image f I , as far as possible, from the observed discrete image (degraded) g D . The process of image formation, modeled by Eq. (4.1), is itself, as a real physical system, affected by noise. In general, in the context of restoration, this noise is considered statistical and can be assumed as an additive noise η(m, n) e, so the Eq. (4.1) becomes g N D (m, n) = f I (m, n) ∗ h D (m, n) + η(m, n) m = 0, 1, . . . , M − 1, n = 0, 1, . . . , N − 1 (4.2) in which the additive noise is assumed with a zero mean Gaussian distribution. If the impulse answer h D is not completely known, the problem of restoration is indicated in the literature as blind restoration, an ongoing research challenge to find ad hoc solutions, to mitigate as much as possible the degradations of the observed image.
gD(m,n)
f(m,n) D
hD(m,n)
f(m,n) hR(m,n)
η(m,n) Fig. 4.1 Block diagram of the image degradation process
4.2 Noise Model
211
In order to restoration, the degraded image g D can be given as input to a spatially invariant linear filter with impulse response h R to try to recover the original image f I . The correct image g R is reconstructed with the restoration filter h R , through the following convolution operation: g R (m, n) = g D (m, n) ∗ h R (m, n)
(4.3)
Substituting in Eq. (4.3) the Eq. (4.2), i.e., the degraded image g N D , we get the correct image g R given by g R (m, n) = [ f I (m, n) ∗ h D (m, n) + η(m, n)] ∗ h R (m, n)
(4.4)
Applying the Fourier transform to the latter and considering the convolution theorem, it is useful to express the reconstructed image in the frequency domain: G R (u, v) = [FI (u, v) · H D (u, v) + ℵ(u, v)] · H R (u, v) = [G D (u, v) + ℵ(u, v)] · H R (u, v) = G N D (u, v) · H R (u, v)
(4.5)
where G R and G D are the Fourier transforms, respectively, of the images g R and g D ; H D , and H R are the Fourier transforms of the impulsive responses h D and h R , respectively; ℵ(u, v) represents the Fourier transform of the noise η(m, n); and finally with G N D (u, v), the Fourier transform of the degraded image is indicated that includes the degradation related to the blurring1 and to the additive noise to distinguish it from the G D which includes only the blurring. If the degradation can be modeled by the convolution operator and the noise η can be neglected, the process of reconstruction (image restoration) can be seen as the inverse process of convolution called also deconvolution. In other words, the deconvolution reverses the effects of a previous convolution operation that models the degradation introduced by the components of the acquisition system. With this strategy, the problem of restoration is analyzed and resolved in the frequency domain. If the noise η is not negligible, the inverse convolution operation is solved as a system of overdetermined linear equations. In the latter case, we use statistical approaches based on minimization of the average quadratic error by comparing the observed degraded image with the ideal one (Wiener filters, Kalman filter). With the model imposed by the (4.2), the restoration problem can be solved by deconvolution via the (4.1) if the additive error is negligible. If the degradation caused by the additive error cannot be neglected, as often happens in reality, it is analyzed by assuming the function of the impulse response h D = 1. It is therefore essential to study the various models of additive error. Noise sources in an image occur mainly in the processes of image formation, acquisition, and transmission.
1 Normally
blurring indicates the effect of blurring an image. In this case, it indicates the effect introduced by a Gaussian filter to reduce the noise in the image as described in Sect. 9.12 Vol. I.
212
4 Reconstruction of the Degraded Image: Restoration
4.2.1 Gaussian Additive Noise In the context of restoration, the noise that degrades the image is essentially due to the fluctuation of pixel intensity in 2D space. A simplified treatment of noise consists in considering the degradation of each pixel as a stochastic process with a zero mean characterized by a normal statistical distribution (Gaussian). Many natural processes are optimally represented (or approximable) by normal stochastic processes.
4.2.1.1 Normal Stationary Stochastic Process If the stochastic process is stationar y, its statistical characteristics do not vary across the image. Noise in the image can be characterized by correlation and stationarity properties. In essence, the additive noise model η(m, n) can be characterized by the correlation of pixels not by their position but by the distance between them. In this case, we have an autocorrelation function Rη (m, n, m , n ) which depends only on the distance of the pixels and we have Rη (m, n; m , n ) = Rη (m − m , n − n ), that is, it does not depend on their position. If the noise of the pixels in the image is completely uncorrelated, the autocorrelation results Rη (m − m , n − n ) = δ(m − m , n − n ) · ση2
(4.6)
ση2
where δ is the Dirac pulse and indicates the variance of the stationary stochastic process η(m, n). The power spectrum Sη or power spectral density is defined by the Fourier transform of the autocorrelation function and is given by Sη (u, v) = F {R(m, n)}
(4.7)
4.2.1.2 White Noise White noise is defined as a stationary normal stochastic process with average and autocorrelation function, respectively, given by μη = 0
Rη (m, n) = δ(m, n)
(4.8)
that is, in any case choosing the position of two pixels in the image, the process imposes their decorrelation and therefore, seen as two normal random variables with zero correlation, they are independent (image pixels are considered as random variable independent and identically distributed—IID). It immediately achieves that the spectral power density results ∞ ∞ Sη (u, v) = −∞ −∞ ∞ ∞
= −∞ −∞
R(m, n)e− j2π(um+vn) dmdn (4.9) δ(m, n)e− j2π(um+vn) dmdn = 1
4.2 Noise Model
213
This implies that the spectrum contains all the frequencies of the image with the same amplitude. The name of white noise recalls the analogy to the white light which, as is known, is composed of a mixture of colors (electromagnetic waves) with uniform spectrum with all the frequencies of the visible (although in reality not all have the same power). It is noted that white noise is characterized by infinite power due to the presence of the power of all frequencies. We know instead that in the reality of the physical systems, these are limited in frequency and it is difficult to generate systems with uniform spectrum in the range between zero and infinity. It is useful to note that the white noise models well the thermal noise derived from the fluctuations of the molecules in the air and the electrons in the resistors within certain frequencies.
4.2.1.3 Ergotic Stochastic Process Another property to consider when modeling noise in the image is that of ergodicity. Normally one has available a single acquisition (realization) of an image and then one has for each pixel (seen as a random variable) a single observation. Rigorous statistical modeling would require a defined number of observations in order to construct a probability density function even for each pixel. In this way, the noise characterized for each pixel would be properly corrected. In reality, we have an image available and we cannot evaluate ensemble averages ensemble average. In these cases, the averages are evaluated on the whole image and the pixels are corrected on the basis of the global average. Basically, the equivalence between the spatial averages and the temporal averages is assumed. This is the characteristic of a Ergotic stochastic process and, based on this property along with that of stationarity, the noise model as a stochastic process can be characterized by calculating the mean μ and the spatial variance σ 2 of an image with the formula notes: N −1 M−1 1 f (l, k) NM
(4.10)
N −1 M−1 1 [ f (l, k) − μ]2 NM
(4.11)
μ=
l=0 k=0
σ2 =
l=0 k=0
With the above-exposed assumptions of the noise model, i.e., stationary and ergodic Gaussian stochastic process, the probability density function p(z) of the gray levels z of the pixels is defined as follows: −(z−μ)2 1 p(z) = √ (4.12) e σ2 2π σ For a given image acquisition device, the parameters that characterize the statistical noise model, the mean μ and the variance σ 2 , are estimated from the degraded image or from a W portion thereof as follows:
μW =
L−1 k=0
z k · pW (z k )
σ2 =
L−1 (z k − μW )2 · pW (z k ) k=0
(4.13)
214
4 Reconstruction of the Degraded Image: Restoration Original Image
Image with noise
Gaussian Noise
Gaussian distribution
c)
b)
d)
a) 0
50
100
150
200
250 0
0.5
1 0
0.5
1
Fig. 4.2 Gaussian additive noise: a PDF curve with μ = 0, σ = 0.04; b test image with only 3 intensity levels (30, 128, 230); c Gaussian noise image; d test image with Gaussian noise. For each image, the relative histograms are shown
where L indicates the number of gray levels in the entire image and pW (z k ) indicates the probability estimates calculated from the frequencies of the relative gray levels z k in the W window examined. Normally around 70% of the intensity level distribution (see Fig. 4.2) falls in the range [(μ − σ ), (μ + σ )] while about 95% in the range [(μ − 2σ ), (μ + 2σ )]. The Gaussian noise model probably reproduces the electronic noise of an acquisition system and sensors. In Fig. 4.2 is shown a test image with 3 intensity levels (30, 128, 230) to which a Gaussian additive noise was added with μ = 0, σ = 0.4. The image with the only Gaussian noise and the histograms of the test image, the noise, and the noise test image is also shown. The latter is appropriately chosen to highlight through the histogram how the original test image is modified.
4.2.2 Other Statistical Models of Noise In addition to the Gaussian model, also in the context of spatial noise of gray levels, other Probability Density Functions are used (PDF).
4.2.2.1 Rayleigh Noise This noise model is characterized by the following PDF: 2 2 (z − a) · e−(z−a) /b i f z ≥ a p(z) = b 0 if z < a
(4.14)
with the mean μ and the variance σ 2 calculated as follows: b(4 − π ) σ2 = μ = a + π b/4 (4.15) 4 where the parameter a indicates the translation of the curve from the origin. As shown in Fig. 4.3, the shape of the PDF curve, compared to the typical Gaussian shape, is slightly oblique to the right (skewed). While with the Gaussian PDF shape,
4.2 Noise Model
215
the mean and variance calculated with Eqs. (4.13) are sufficient, for the Rayleigh PDF, the parameters a and b must be estimated from the histogram, note the mean and variance. The Rayleigh noise model is more appropriate for radar images (distance measurements) and images with a slight dynamics in the scene. In Fig. 4.3 is shown a test image with 3 intensity levels (30, 128, 230) to which an additive Rayleigh noise was added with the PDF curve characterized by parameters a = 0, b = 0.4. The test image is also shown with added Rayleigh noise and its histogram. The latter is characterized by the 3 peaks associated with the 3 levels of intensity present in the test image. The typical Rayleigh noise distribution profile is highlighted.
4.2.2.2 Erlang Noise (Gamma) The PDF curve of Erlang’s noise is given as follows: b b−1 a z e−az i f z ≥ 0 p(z) = (b−1)! 0 if z < 0
(4.16)
where the average μ and the variance σ 2 are given by b b (4.17) μ= σ2 = 2 a a with the parameters a > 0 and b positive integers that together model the shape of the PDF curve (see Fig. 4.4). The distribution of Erlang is a special case of the Gamma distribution (described in Sect. 9.2.4, Vol. I) where the parameter a does not necessarily have to be an integer. This type of noise is considered in the treatment of laser images. In Fig. 4.4 is shown a test image with 3 levels of intensity (30, 128, 230) to which an additive noise of Erlang was added with the PDF curve characterized by parameters a = 2, b = 1. The test image is also shown with the added Erlang noise and its histogram. The latter is characterized by the 3 peaks associated with the 3 levels of intensity present in the test image. The typical profile of the Erlang noise distribution is highlighted.
Rayleigh distribution Image with additive Rayleigh noise 1000
500
0 0
a)
b) 0
50
100
150
200
0.5
1
c)
250
Fig. 4.3 Rayleigh additive noise: a PDF curve with a = 0, b = 0.4; b test image with only 3 intensity levels (30, 128, 230) with Rayleigh additive noise and c relative histogram
216
4 Reconstruction of the Degraded Image: Restoration
p(z) K
Erlang distribution (Gamma) Image with Erlang additive noise 1000
a(b-1) b-1 -(b-1) K= e (b-1)!
500
0 0
b)
a) 100
50
0
150
200
c)
0.5
1
250
z
(b-1)/a
Fig. 4.4 Erlang additive noise (Gamma): a PDF curve with a = 2, b = 1; b test image with only 3 intensity levels (30, 128, 230) with additive noise of Erlang and c relative histogram p(z)
a
Exponential Distribution Image with exponential additive noise
1000
500
0
b)
a) 0
50
100
150
200
0
c)
0.5
1
250
z
Fig. 4.5 Exponential additive noise: a PDF curve with a = 3; b test image with only 3 intensity levels (30, 128, 230) with exponential additive noise and c relative histogram
4.2.2.3 Exponential Noise The PDF curve of the Exponential noise (Fig. 4.5) is given as follows: a · e−az i f z ≥ 0 p(z) = 0 if z < 0
(4.18)
where the average μ and the variance σ 2 are given by 1 1 (4.19) σ2 = 2 a a with the parameter a > 0. The Exponential noise has the PDF curve that coincides with that of Erlang defined with the parameter b = 1. Like the Erlang noise, even the exponential noise is considered in the treatment of laser images. Figure 4.5 similarly presents the exponential noise application with parameter a = 3 for the same test image. μ=
4.2 Noise Model
217
4.2.2.4 Uniform Noise The PDF curve of the uniform noise (Fig. 4.6) is given as follows: 1 if a≤z≤b p(z) = b−a 0 other wise
(4.20)
where the average μ and the variance σ 2 are given by μ=
a+b 2
σ2 =
(b − a)2 12
(4.21)
4.2.3 Bipolar Impulse Noise The PDF curve of the bipolar impulse noise (Fig. 4.7) is given by the following expression: ⎧ if z=a ⎨ Pa if z=b (4.22) p(z) = Pb ⎩ 0 other wise This noise is also called salt and pepper as it produces light and dark points scattered throughout the image. If b > a, pixels with intensity b will appear very bright, while pixels with intensity a will appear as black dots. If Pa = Pb , you have a unipolar impulse noise.
4.2.4 Periodic and Multiplicative Noise Periodic noise is generated by electronic and electromagnetic interference both during the acquisition and transmission of images. It manifests itself on the whole image with regular geometric structures and, often, is of the undulatory type with different frequencies. This characteristic of periodicity is easily visible in the frequency
p(z)
Uniform Distribution
1 b-a
Image with uniform additive noise 600 400 200 0 0
a)
a
b
z
b)
0.5
1
c)
Fig. 4.6 Uniform additive noise: a curve PDF a = −1, b = 1; b test image with only 3 intensity levels (30, 128, 230) with uniform additive noise and c relative histogram
218
4 Reconstruction of the Degraded Image: Restoration
Images with impulse noise 0.1 Bipolar (Salt & Pepper)
a)
Unipolar (Salt)
c)
b)
0
128
255 0
Unipolar (Pepper)
128
255 0
128
255
Fig.4.7 Impulse additive noise: a test images with only 3 intensity levels (30, 128, 230) with bipolar additive noise 0.1 (salt and pepper); b unipolar salt; c unipolar pepper and related histograms
domain. The removal of periodic noise is reported in Sect. 9.11, Vol. I with the description of the various filters in the frequency domain (see Chap. 9, Vol. I). If the frequencies associated with the periodic noise are well separable (especially those caused by spurious light flashes or due to transmission), these can be removed selectively with the various band-stop filters: ideal, Butterworth, Gaussian, and Notch. The multiplicative noise, also known as speckle noise, occurs during the acquisition and transmission of images, particularly in contexts of high luminance with the presence of irregular surfaces. Since this is the noise produced also by interference phenomena, it is generated whenever images of complex objects are acquired by using highly coherent waves (for example in remote sensing acquisition systems with SAR—Synthetic Aperture Radar images). This type of noise depends on the signal itself that produces it and it is difficult to remove it with the traditional noise models presented previously. A solution is described in Sect. 1.22.5 with the homomorphic filter.
4.2.5 Estimation of the Noise Parameters The characteristics of the additive noise of an image caused by the acquisition device (sensor and control electronics) can be provided in the technical specifications of the device, but this generally does not take into account operating conditions that often alter its normal functionality. An alternative way is to directly calibrate the acquisition system in real operating conditions to estimate the characteristics of the intrinsic noise. For example, by using sample objects and operating in different lighting conditions, different images are acquired from which the characteristics of
4.2 Noise Model
219
the acquisition system are deduced by estimating the parameters useful for removing the generated noise. On the other hand, if the degraded images are directly available, the strategy to be used is to estimate the noise parameters through the evaluation of the image statistic (variance and mean) from which then to extract the parameters a and b that characterize the PDF specification. The statistical parameters are estimated with Eqs. (4.13) by calculating the histogram of the gray levels pW (z k ) on areas of the image. These values represent an estimate of the probability of occurrence of the gray level z k and together constitute a sufficient approximation of the PDF. If this approach is not sufficient for noise removal, in analogy to the filtering methods used to improve the visual qualities of the image, spatial filtering is also implemented for restoration, as we will see in the following paragraph.
4.3 Spatial Filtering for Noise Removal In the absence of blurring, with the degraded image only with additive noise η(x, y), the degradation model is g(x, y) = f (x, y) + η(x, y)
(4.23)
An approximate estimate of the original image f (x, y) is obtained by implementing a spatial filtering through the convolution operator. Basically, the kernel functions, already described in Sect. 9.12, Vol. I of the arithmetic mean filter and Gaussian filter for the smoothing of the image are applied. As is known, the characteristics of these filters do not alter the local average and are spatially invariant. These filters reduce uniform and Gaussian noise. In addition, we describe in the following other types filters that from the degraded image g(x, y) estimate the image fˆ(x, y), i.e., an approximation of the f (x, y).
4.3.1 Geometric Mean Filter The geometric mean filter produces a smoothing effect in the image similar to the arithmetic average filter, but has the advantage of less detail loss. If we indicate with Wx y the rectangular window of dimensions m × n, centered in the generic position (x, y) of the degraded image g(x, y), the filtered image fˆ(x, y) with the geometric mean is given by the following relation:
1 mn ˆ f (x, y) = g( j, k) (4.24) ( j,k)∈Wx y
where each pixel is recalculated with the product of all the pixels included in the W window and raised to the power of 1/m · n.
220
4 Reconstruction of the Degraded Image: Restoration Geometric mean
Harmonic mean
Contra-Harmonic mean
Salt & pepper Noise
Uniform Noise
Rayleigh Noise
Gaussian Noise
Arithmetic mean
Q=−1.5
Fig. 4.8 Application of spatial mean filters for the removal of the only additive noise present in the simple test image with 3 levels of intensity. The type of noise applied is in the following order: Gaussian with mean zero and variance 0.12 (first row); Rayleigh a = 0, b = 0.4 (second row); uniform a = 0, b = 1 (third row); salt and pepper with probability 0.1. 3 × 3 filters are used in the order of columns: Arithmetic, Geometric, Harmonic, and Contraharmonic mean with Q = −1.5
4.3.2 Harmonic Mean Filter The Harmonic Mean filter, operating in the same way as the previous one, is given by the following expression: mn (4.25) fˆ(x, y) =
( j,k)∈Wx y
1 g( j,k)
This filter is adequate to reduce the Gaussian noise and impulse noise relative to the white dots (salt noise), instead it shows artifacts, with black points, in the case of impulse pepper noise.
4.3 Spatial Filtering for Noise Removal
221
4.3.3 Contraharmonic Mean Filter The image recovered with the contraharmonic mean filter is obtained through the following expression:
g( j, k) Q+1 ( j,k)∈Wx y (4.26) fˆ(x, y) =
g( j, k) Q ( j,k)∈Wx y
where Q indicates the order of the filter. It proves to be ineffective to simultaneously eliminate the impulse salt and pepper noise. For values of Q > 0, it only eliminates the pepper noise, while with Q < 0, it eliminates the salt noise. Placing Q = −1 is reduced to the harmonic mean filter, while for Q = 0 becomes the arithmetic mean filter. It follows that the application of this filter must be carefully evaluated in relation to the degradation present in the image.
4.3.4 Order-Statistics Filters These spatial filters are based on the ordering of the pixels included in the W filter action window which determines the position of the pixel to be selected. In this type of filters are included the median, maximum, and minimum filter, already described in Sects. 9.12.4 and 9.12.5 of Vol. I. The median filter proved to be very effective for the removal of salt and pepper impulse noise (unipolar and bipolar) without altering the image details. The recovered image with the median filter is given by the following: (4.27) fˆ(x, y) = median {g( j, k)} ( j,k)∈Wx y
4.3.4.1 Max and Min Filters The recovered image with the minimum and maximum filter is given by the following expressions: (4.28) fˆ(x, y) = max {g( j, k)} ( j,k)∈Wx y
fˆ(x, y) =
min {g( j, k)}
( j,k)∈Wx y
(4.29)
The minimum filter removes salt noise but darkens the image, while the maximum filter removes the pepper noise and lightens it.
4.3.4.2 Midpoint Filter This filter evaluates in the W processing window the central value between the minimum and the maximum as follows: max {g( j, k)} + min {g( j, k)} ( j,k)∈Wx y ( j,k)∈Wx y ˆ (4.30) f (x, y) = 2
222
4 Reconstruction of the Degraded Image: Restoration
The effectiveness of this filter is best demonstrated by noise with a random distribution such as Gaussian or uniform.
4.3.4.3 Alpha-Trimmed Mean Filter Alpha-trimmed mean filter belongs to the nonlinear filter class based on order statistics. In situations where the image is degraded with combined Gaussian and impulsive (salt-and-pepper) noise, the following filter is known as alpha trimmed: 1 gr ( j, k) (4.31) fˆ(x, y) = mn − d ( j,k)∈Wx y
In this case, they are deleted between the pixels g( j, k) of the Wm×n window in processing, the d/2 lower intensity values and the d/2 higher values, and the average is calculated on the remaining pixels indicated with gr . The filter effect is controlled by the d parameter that takes on value in the range [0, nm − 1]. The alpha-trimmed mean filter varies between a median and a mean filter. Indeed, the (4.31) becomes the equation of the arithmetic mean filter if d = 0 and it becomes the equation of the median filter if d = mn − 1.
4.3.5 Application of Spatial Mean Filters and on Order Statistics for Image Restoration Having analyzed the various models of exclusively additive noise and described the main linear and nonlinear spatial filters, we now demonstrate their applicability on a simple test image (with only 3 levels of intensity) and on a more complex image (electronic components board). In Fig. 4.8, the spatial filtering results are shown, to remove the additive and impulse noise present in the simple test image, using various media filters applied in the following order: Gaussian, Rayleigh, Uniform, Salt and Pepper. From a qualitative analysis, it is highlighted how the mean filters tend to remove the additive noise in an almost identical way with the peculiarity that the geometric, harmonic, and contraharmonic mean filters introduce a less blurring in the areas of contrast in the image. These are completely unsuitable for the removal of bipolar impulse noise. The functioning details of the filters are shown in Fig. 4.11. Given the relevance of impulse noise in restoration, the behavior of filters for unipolar impulse noise removal was analyzed. Figure 4.9 shows the results of the application of the order filters Maximum (second column), Minimum (third column), and Contraharmonic mean operating with parameter Q = −1.5 ( fourth column) and with Q = 1.5 (fifth column). These filters prove to be inadequate to remove bipolar impulse noise, but they are effective for removing unipolar impulse noise. In particular, the Maximum filter has an acceptable result only in removing the pepper impulse noise, similarly the Minimum filter only removes the salt noise well. Contraharmonic filter eliminates the salt noise when used with the Q = −1.5 parameter, while with the positive sign
4.3 Spatial Filtering for Noise Removal
Contra-Harmonic
Salt & pepper Noise
Contra-Harmonic
Q=−1.5
Q=1.5
Pepper Noise
Min
Q=−1.5
Q=1.5
Salt Noise
Max
223
Q=−1.5
Q=1.5
Fig. 4.9 Application of the spatial filters of Maximum, Minimum, and Contraharmonic for the removal of only the impulse noise present in the simple test image with 3 levels of intensity. The type of noise applied is in the order of the rows, respectively, bipolar (Salt and Pepper), unipolar Pepper , and unipolar Salt all with probability 0.1. 3×3 filters are used in the order of the columns: Maximum, Minimum, Contrahar monic with Q = −1.5 and Q = 1.5
of Q, it eliminates the pepper noise. Based on this analysis, all the mean and orderstatistics filters have been analyzed that have shown a good capacity for removal (or attenuation) of bipolar and unipolar impulse noise, applied on the complex boar d test image. The results are shown in Fig. 4.10. The median filter shows the best results, although it has been used with 3 × 3 and no replica, for the removal of salt and pepper noise, as well as the AlphaTrimmed filter with parameter d = 6. For the removal of the salt or pepper unipolar noise are confirmed the good results of the maximum, minimum, harmonic, and contraharmonic mean filters also on the complex boad image. Figure 4.11 shows a further application of restoration of the image with the presence of more noise types and using spatial mean filters and on order statistics The first row shows the results on the boar d image contaminated by uniform random noise with mean zero and variance 0.3. The results obtained with the harmonic, midpoint, and alpha-trimmed filters are reported. The second line shows the restoration results of the image to which the impulse noise salt and pepper with probability 0.1 has been added. As expected, the median filter, with the presence of impulse noise, shows a good functionality together with the alpha-trimmed filter which also acts with the moderate smoothing function.
224
4 Reconstruction of the Degraded Image: Restoration
Salt & pepper Noise
Pepper Noise
Salt Noise
Mean 3x3
Contra-Harmonic Q=1.5
Harmonic Mean 3x3
Median 3x3
Alpha-Trimmed 3x3 d=6
Original Image
Max 3x3
Harmonic Mean 5x5
Min 3x3
Fig. 4.10 Application of spatial mean and order-statistics filters for the removal of impulse noise present in the boar d test image. The type of noise applied is in the order of the rows, respectively, bipolar (Salt and Pepper), unipolar Pepper , and unipolar Salt all with probability 0.1. The filters used are those that have shown the best performance for bipolar and unipolar impulse noise Uniform Noise
Uniform Noise + S&P
Original Image
Geometric mean 5x5
Harmonic Mean 3x3
Mid-point 3x3
Alpha-Trimmed 3x3 d=4
Harmonic Mean 5x5
Median 5x5
Alpha-Trimmed 5x5 d=6
Fig. 4.11 Application of mean and order-statistics filters for the removal of the additive and impulsive noise present jointly in the boar d test image. The first line shows the results of removing only the uniform random noise (with zero mean and variance 0.3) present in the boar d filtered image with windows 3 × 3. The second row shows the results of the geometric mean, harmonic mean, median, and Alpha-Trimmed filters applied to the image with Salt and Pepper noise (with probability 0.1) and uniform noise
4.4 Adaptive Filters
225
4.4 Adaptive Filters So far, filters with constant local actions have been considered assuming an identical degradation on the whole image. In this category of filters (nonlinear) fall those that are based on the local statistical characteristics of the image. An adaptive filter checks, in the window of the image being processed, the intrinsic structures present (for example, edge, homogeneous zones) and then acts consequently with the action of the filter which will tend to smooth less areas of high variance with respect to the more uniform ones. A local evaluation of the characteristics of the image is 2 of the W window being performed by evaluating the mean μW and variance σW processed. Note the local statistical characteristics, it occurs how much the variance 2 deviates from that of the noise σ 2 that has degraded the image f (x, y). The σW η variance ση2 , as is known, is estimated on the degraded image with Eq. (4.11). The filter actions must meet the following conditions: 2 σ 2 , then the smoothing action of the filter must be moderate (it must 1. If σW η return a value almost identical to g(x, y)) for preserve the details in the W window. 2 σ 2 , then the action of the filter must be 2. If the variances are comparable σW η accentuated by returning the value of the arithmetic mean (assuming zone with noise, with the peculiarity that the whole image and the window have the same variance). 3. Finally, if ση2 = 0, no noise is assumed where the image f (x, y) would not be degraded and the filter action should simply return g(x, y).
The adaptive filter which realizes the above-mentioned functions is given by the following formula: ση2 (4.32) fˆ(x, y) = g(x, y) − 2 [g(x, y) − μW ] σW The functionality of the filter depends very much on the size of the window to be chosen based on the characteristics of the image. The (4.32) implicitly assumes that it 2 . On the one hand, this seems reasonable considering that noise is additive is ση2 ≤ σW and does not vary spatially, and also because of the relationship of the dimensions between the whole image and window, but in reality, this assumption can be violated. However, it is possible to manage this situation, when it occurs, admitting negative intensity values and then normalizing them, although with a loss of their dynamics. Figure 4.12 shows the results of the adaptive filter applied to a contaminated image from Gaussian noise with zero mean and variance 0.01. The adaptive filter used has size 7 × 7 and for an immediate comparison is also reported the result of the mean filter with the same size. It highlights the best quality of the adaptive filter in removing the noise without introducing a significant blurring in the image. Basically, in the high-contrast areas, the details remain clearly visible. Obviously, the filter performances are strictly linked to the estimation of the statistical parameters (mean and variance) with which the noise is modeled. In particular, the filter tends to increase the smoothing if the noise is estimated with a higher value than the variance 2 ). (ση2 > σW
226
4 Reconstruction of the Degraded Image: Restoration Original Image
Gaussian Noise Var. 0.01
Mean Filter 7x7
Adaptive Filter 7x7
Fig. 4.12 An example of the application of the Local Adaptive filter for the removal of the Gaussian additive noise with mean zero and variance 0.01. The results of the mean filter and the adaptive filter are shown
4.4.1 Adaptive Median Filter The filters described so far are applied to the whole image assuming a stationary or spatially invariant noise. The median filter is very effective for attenuating the not very dense impulse noise. The adaptive median f ilter is effective in the presence of dense impulse noise and produces a smoothing in the presence of non-impulse noise. The peculiarity of this filter is given by the variability of the window size based on the local characteristics of the image. Now consider a Wx y image window with the following definitions: (a) (b) (c) (d) (e)
z min = minimum value of intensity in Wx y z max = maximum value of intensity in Wx y z med = median value in Wx y z x y = intensity level at the position (x, y) Wmax = maximum size of the window Wx y
The filtering procedure provides two levels of operation as described by Algorithms 1 and 2. The adaptive median filter demonstrates the following peculiarities: (a) Removes impulse noise salt and pepper. (b) Attenuates other types of additive noise through smoothing. (c) Attenuates degradations due to blurring. To effectively analyze the peculiarities of the adaptive median filter, it has been applied to two types of test images. In Fig. 4.13 are shown the results of the filter applied to the image Lena (with dominant low frequencies) corr upted with high probability impulse noise, 0.3 (images of the first line) and 0.4 (images of the second
4.4 Adaptive Filters
227
line). Despite the high noise, the filter shows its peculiarity in removing noise even with limited window size, in this case set to the maximum size of 7 × 7. For an objective evaluation of the effectiveness of the filter, the results are also reported of the median filter with the relative values of the P S N R and R M S E. It is highlighted that the values compared to the P S N R and R M S E between the two filters indicate the best ability of the adaptive median filter to reduce the noise. For reasons of space, the results of the median filter applied with windows 7×7 and of the adaptive median filter with a maximum window of 9 × 9 are not reported. The results, for this image, did not show significant differences both in qualitative and quantitative terms. Algorithm 1 Adaptive Median Filter Algorithm: Level A A1 = z med − z min A2 = z med − z max if A1 > 0 and A2 < 0 then Go to Level B else Increase window size W end if if Size Wx y ≤ Wmax then Repeat Level A else Return Z med end if
Algorithm 2 Adaptive Median Filter Algorithm: Level B B1 = z x y − z min B2 = z x y − z max if B1 > 0 and B2 < 0 then Return Z x y else Return Z med end if
Figure 4.14 shows the results of the filter applied to the boar d image (characterized by dominant high frequencies) corr upted, in addition to the impulse noise, also by the Gaussian additive noise. For this type of image, the effectiveness of the filter is very evident both in terms of quality and quantity. Note, as in the first two lines (the first with impulsive noise 0.3 and the second with noise 0.4) and in the last column, the image details (white dots and black thin stripes) have remained intact, despite
Median Adap. 7x7 PSNR 74.66 RMS 0.05 Median Adap.7x7 PSNR 74.70 RMS 0.05
Median 5x5 PSNR 66.92 RMS 0.11
4 Reconstruction of the Degraded Image: Restoration
Median 5x5 PSNR 68.73 RMS 0.09
Salt & Pepper Noise Prob. 0.4
Salt & Pepper Noise Prob. 0.3
228
Fig. 4.13 Application of Adaptive Median filter to remove additive noise Salt and Pepper with probability 0.3 and 0.4. The results of the Median filter of 5 × 5 size are also reported, while the maximum window size for the adaptive filter is 7 × 7
the high probability of noise, while those resulting from the second column, related to the median filter, have been partially removed. Note also the best sharpness and contrast compared to the median filter. The third row shows the results of the filter applied to the corrupt image by Gaussian noise with zero mean and variance 0.005 with added impulsive noise with probability 0.3. The adaptive median filter confirms its peculiarity in eliminating impulsive noise and also attenuating other types of additive noise, in this case the Gaussian one. Observe how the details, despite the Gaussian noise, have remained even if less contrasted, while, as expected, the results of the median filter confirm a worsening in terms of clarity and loss of detail. Using windows 7 × 7 for the median filter and maximum sizes up to 9 × 9 for the adaptive one, no better results are obtained, the median filter worsens by increasing the blurring, while the adaptive one remains stationary, albeit with a higher computational load.
4.5 Periodic Noise Reduction with Filtering in the Frequency Domain The selective filtering approach also in the context of r estor e is very effective for isolating noise and reconstructing the image by appropriately choosing the bandwidth (around some frequencies) and the type of filters band stop (ideal, Butterworth, and Gaussian). The latter have been described in Sect. 1.20.
4.5 Periodic Noise Reduction with Filtering in the Frequency Domain
229
S & P Noise Prob.0.30
Median 5x5 PSNR 69.57 RMS 0.08
Median Adap. 7x7 PSNR 74.91 RMS 0.05
S & P Noise Prob.0.40
Median 5x5 PSNR 68.17 RMS 0.10
Median Adap.7x7 PSNR 73.16 RMS 0.06
Gaussian Noise 0.005 + S & P 0.30
Median 5x5 PSNR 69.22 RMS 0.09
Median Adap.7x7 PSNR 70.88 RMS 0.07
Fig. 4.14 Application of Adaptive Median filter in presence of combined noise Gaussian and impulse additive Salt and Pepper
Figure 4.15 shows an example of image reconstruction based on the bandwidth of band-stop filter at which the spectral energy associated with the frequencies of periodic interferences is concentrated. Observe how periodic noise creates artifacts in the spatial domain over the entire image, while in the frequency domain it modifies only the coefficients corresponding to the frequencies of the waves that generated the noise. In essence, the periodic noise amplifies the form of the corresponding spectral coefficients (the white points in the figure). In this context, the band-pass filters are also useful (already described in Sect. 1.21) which, operating in the opposite way to the band-stop ones, can isolate the components of a band of frequencies of an image. In fact, it may be useful to isolate the periodic noise present in an image to better analyze it independently of the image. Figure 4.16 shows the result of the band-pass filter H P B (u, v) = 1 − H B B (u, v),
230
4 Reconstruction of the Degraded Image: Restoration
a)
c)
b)
d)
Fig. 4.15 Example of reconstruction of contaminated image from periodic noise. a Image with periodic sinusoidal noise; b spectrum of the image (a) with the white dots representative of the frequencies related to the noise. c Noise filtering mask; d filtering result
obtained from the band-stop filter H B B (u, v), applied to the spectrum of Fig. 4.15. Basically, the band of the noise frequencies (represented by the luminous points) is isolated with respect to the total spectrum and, by antitransforming, the noise-free image (or very attenuated) in the spatial domain is obtained.
4.5.1 Notch Filters This type of filters stop or pass frequencies associated with a narrow band around a central frequency and appear in symmetric pairs about the origin by virtue of the symmetry property of the Fourier transform. The transfer function of a ideal notchstop filter of radius l0 with center in (u 0 , v0 ) and the symmetric one at (−u 0 , −v0 ) is given by 0 i f l1 (u, v) ≤ l0 or l2 (u, v) ≤ l0 H B N (u, v) = (4.33) 1 other wise where
l1 (u, v) =
u−
M − u0 2
2
2 1 2 N + v− − v0 2
(4.34)
4.5 Periodic Noise Reduction with Filtering in the Frequency Domain
231
Fig. 4.16 Image of the periodic noise isolated from the image of Fig. 4.15
and
l2 (u, v) =
M u− − u0 2
2
N + v− − v0 2
2 1 2
(4.35)
where M × N indicates the frequency domain size, l1 (u, v) and l2 (u, v) are the distances within the filter areas, calculated with respect to the centers of the same Notch filters. The transfer function of the Butterworth Notch-Stop—BNS filter is given by 1 (4.36) H B N S (u, v) = n
l2 1 + l1 (u,v)·l0 2 (u,v) where n indicates the order of the filter. The expression of the Gaussian NotchStop—GNS filter is given by
HG N S (u, v) = 1 − e
− 21
l1 (u,v)·l2 (u,v) l02
(4.37)
Figure 4.17 shows the 3D graphs of the three notch-stop filters (ideal, Butterworth, and Gaussian) used for the removal of the periodic noise of the image of Fig. 4.15a. Given the symmetry of the Fourier transform, the Notch filters are presented, as shown in the figure, in pairs that are symmetrical with respect to the origin of the frequency domain. The shape of these filters is not bound to the circular shape alone and can be considered as multiple pairs. The Notch-Pass filters H N P can be derived from the previous Notch-Stop filters H N S , with the usual equation H N P (u, v) = 1 − H N S (u, v)
232
4 Reconstruction of the Degraded Image: Restoration
Ideal Notch-Stop
Butterworth Notch-Stop (order 2)
Gaussian Notch-Stop
Fig. 4.17 3D representation of Notch-stop filters (Ideal, Butterworth, and Gaussian Image) applied for the reconstruction of the image with periodic noise of Fig. 4.15
where H N P and H N S are the relative transfer functions of the Notch-Pass and NotchStop filters, respectively. They have an inverse function, i.e., they allow all frequencies in the Notch filter area to pass. Figure 4.18 reports the results of the application of the Butterworth Notch Stop filter of order 2 (see Fig. 4.17) for the removal of the periodic noise of the apollo image of Fig. 4.15a. The notch areas are all of size l0 = 10, appropriately chosen after several attempts, considering the order n and the best value of the PSNR. It is noted that the notch filter is more selective (see Fig. 4.18c), in the removal of the frequencies associated with the noise, with respect to the band-stop filter described previously applied for the same image (Fig. 4.15c). Notch filters have better flexibility and selectivity through Notch areas (variable in terms of shape and size) to manage noise removal in real applications, where periodic noise even in the frequency domain is distributed in more complex forms. The Notch high-pass and Notch low-pass are obtained, respectively, from the Notch-stop and Notch-pass filters by imposing (u 0 , v0 ) = (0, 0).
4.5.2 Optimum Notch Filtering A further improvement of the Notch filter is obtained with the so-called optimum Notch Filtering, a heuristic approach, based on the minimization of the local variance of the reconstructed image. This approach overcomes the limitations of previous filters especially in the presence of images with different interference components. This approach initially involves isolating the most important structures responsible for the interference and then subtracting from the degraded image these weighted structures with local statistics of the reconstructed image. The procedure includes the following main steps: 1. E xtraction of the main frequency components associated with the interference structures. For this purpose, a notch-pass-band filter is used H N P (u, v) centered on the interference structure. The Fourier transform of the interference structure, due to the noise, is given by the following: ℵ(u, v) = H N P (u, v) · G(u, v)
(4.38)
4.5 Periodic Noise Reduction with Filtering in the Frequency Domain
233
a)
c)
b)
d)
Fig. 4.18 Application of the Butterworth Notch-stop filter of order 2 for the reconstruction of the image with periodic noise of Fig. 4.15 with all Notch areas of size l0 = 10
where G(u, v) indicates the Fourier transform of the degraded image. The experimental definition of the filter H N P (u, v) requires a trial and error approach, in the sense that it must be carefully evaluated, observing the spectrum G(u, v) on a monitor, to correctly associate the filter with a real interference structure in the spatial domain. 2. Analysis in the spatial domain of the interference structure selected in the previous step. This is accomplished with the following antitransformed: (4.39) η(m, n) = F −1 {H N P (u, v) · G(u, v)} ˆ 3. Calculation of the estimate f (m, n) of the original image f (m, n). If the interference η(m, n) is known, given the physical additive nature of the degradation of the image, the original image could be obtained by the simple subtraction from the degraded image of such noise: f (m, n) = g(m, n)−η(m, n). On the contrary, from the filtering operation, only an approximation of the real structures can be obtained. One way to compensate for this approximation, i.e., to take into account structures not present in η(m, n), may be to minimize the contribution of η(m, n), with a weighting function (or modulation) w(m, n), to obtain an approximation of the original image given as follows: (4.40) fˆ(m, n) = g(m, n) − w(m, n) · η(m, n) One way to get the weight function w(m, n) is to consider the local statistic at each point (m, n) by considering a window W (m, n) of size (2a + 1) × (2b + 1) and
234
4 Reconstruction of the Degraded Image: Restoration
minimizing on this window the variance of the estimate of fˆ(m, n). With the W window centered at a point (m, n), the local variance of fˆ(m, n) is calculated as follows: σ 2 (m, n) =
b a 1 [ fˆ(m + s, n + t) − fˆ(m, n)]2 (2a − 1)(2b − 1) s=−a
(4.41)
t=−b
where fˆ(m, n) indicates the local average value of fˆ(m, n) on the W window given by b a 1 fˆ(m, n) = fˆ(m + s, n + t) (4.42) (2a − 1)(2b − 1) s=−a t=−b
Substituting the (4.40) in the previous equation, we get the following: σ 2 (m, n) =
a b 1 {[g(m + s, n + t) − w(m + s, n + t)η(m + s, n + t)] (2a − 1)(2b − 1) s=−a t=−b
(4.43)
− [g(m, n) − w(m, n)η(m, n)]}2
To simplify the functional to be minimized, it is useful to assume that the modulation function is constant in the window W considered and therefore, we have that w(m + s, n + t) = w(m, n) for every (s, t) ∈ W . From this, it follows: w(m, n)η(m, n) = w(m, n)η(m, n)
(4.44)
At this point, we can replace the (4.44) in the (4.43) which, with the simplifications considered, becomes 1 σ 2 (m, n) = (2a−1)(2b−1)
a
b
s=−a
t=−b {[g(m+s,n+t)−w(m,n)η(m+s,n+t)]
−[g(m,n)−w(m,n)η(m,n)]}2
(4.45)
To minimize variance σ 2 , the following is required: ∂σ 2 (m, n) =0 ∂w(m, n)
(4.46)
and resolving with respect to w(m, n), we get the following result: w(m, n) =
g(m, n)η(m, n) − g(m, n)η(m, n) η2 (m, n) − η2 (m, n)
(4.47)
Recall that g(m, n) and η(m, n) are, respectively, the local average of the degraded image and the additive noise structure, while η2 is the squared mean noise output from the notch filter, η2 (m, n) is the mean squared noise output from the notch filter, and g(m, n)η(m, n) is the mean product between the local average of the degraded image and the noise. Therefore, the application of the optimum Notch filtering is accomplished by first estimating with Eq. (4.39) the noise η and calculated the modulation function w(m, n) with the (4.47), the reconstructed image is obtained with the (4.40).
4.6 Estimating the Degradation Function
235
4.6 Estimating the Degradation Function We have examined various types of noise, we now return to consider the blurring problem of an image introduced in the first paragraph. The main degradation sources are caused during the acquisition of the images due to the optical and motion blurring, digitization and quantization, environmental conditions (for example, atmospheric turbulence blurring), and additive noise on the pixel intensity caused by acquisition sensor and during image transmission. We rewrite the general equation of degradation (4.2) of an image in the hypothesis of a spatially invariant linear system (to model stationary unfocused noise with the image) in the presence of additive noise: g N D (m, n) = f I (m, n) ∗ h D (m, n) + η(m, n)
(4.48)
where we remember g N D is the degraded image, f I is the image before degradation, h D indicates, in this case, the impulse degradation response (point spread function— PSF), and η is the additive noise. The previous degradation equation in the Fourier domain results (4.49) G N D (u, v) = FI (u, v) · H D (u, v) + ℵ(u, v) With reference to the (4.49), the problem of r estor e of the degraded image is reduced to the estimation of the degradation function H D (u, v). If the estimate of the latter is made without any prior knowledge of it, we speak of blind deconvolution because ˆ with the deconvolution, we obtain then an approximation F(u, v) of the original image FI (u, v). There are three methods for determining an estimate of H D : 1. Obser vation of the degraded image. 2. E x perimental determination. 3. Determination based on physical–mathematical model.
4.6.1 Derivation of H D by Observation of the Degraded Image Starting from the degraded image g D (only presence of blurring whose the degradation function h D is unknown) are analyzed some of its significant parts (for example, zones of high contrast) that we indicate with gW (u, v). In correspondence of gW the corresponding approximation fˆW (u, v) of the original not degraded image and consequently an estimate is obtained h W (u, v) of h D (u, v). If we consider the additive noise η to be negligible, operating in the Fourier domain, from Eq. (4.49), we obtain H D (u, v) ≈ HW (u, v) =
G W (u, v) FˆW (u, v)
(4.50)
Evaluated HW (u, v) for a small window of the original image, we can use the form of this degradation function to get an estimate of H D (u, v) for the whole image.
236
4 Reconstruction of the Degraded Image: Restoration
4.6.2 Derivation of H D by Experimentation Using an acquisition device similar to the one that generated the degraded image, one can derive the impulse response h D (m, n) from that system. Basically, from this same device acquires the image g D (m, n) of a point source (small bright spot) that simulates the δ function of amplitude A. Considering that the Fourier transform of an impulse is still constant with A describing the strength of the impulse, in the frequency domain, we obtain G D (u, v) (4.51) A which corresponds to an estimate of the transfer function, i.e., to the system degradation function (see Fig. 4.19) and G D is the Fourier transform of the observed spot image. It is assumed that the degradation system is linear and spatially invariant. For an accurate determination of the degradation function, different acquisitions may be required. H D (u, v) =
4.6.3 Derivation of H D by Physical–Mathematical Modeling: Motion Blurring Following a phenomenological analysis of the causes of degradation, specific to the application context, it is possible to define a physical–mathematical model that describes the degradation process of the image. For example, physical degradation models caused by the relative linear motion (motion blurring) between object and observer (acquisition system) have been proposed in literature. In these linear motion conditions, we impose some assumptions: 1. The error of the optical component and the delay introduced by the shutter are ignored; 2. The image f (m, n) is considered relative to the objects of the scene included in the same plane; 3. The exposure in each pixel is obtained by integrating the instantaneous exposure into the time interval T during which the shutter is open.
Fig. 4.19 Experimental determination of the transfer function H D (u, v) through the blurring of a point source
Point source (impulse image)
Observed image of degraded impulse hD(m,n)
4.6 Estimating the Degradation Function
237
With such assumptions, the degraded image acquired g D (m, n) is given by T g D (m, n) = f [m − dx0 (t), n − d y0 (t)]dt (4.52) 0
dx0 (t)
d y0 (t)
and denote, respectively, horizontal and vertical translation in the where time t whose expressions are at bt d y0 (t) = (4.53) T T with a and b which indicate the width of the blurring in the horizontal and vertical direction of motion, respectively. When t coincides with the time of exposure T , we have the maximum width of blurring which corresponds precisely to the values of a and b. The Fourier transform of the degraded image (4.52) is calculated as follows: dx0 (t) =
G D (u, v) = =
∞
∞
g D (m, n)e −∞ −∞ ∞ ∞ T
=
−∞ −∞ T ∞
0 ∞
−∞ −∞
0 T
=
− j2π [um+vn]
dmdn
f [m − dx0 (t), n − d y0 (t)]dt e− j2π [um+vn] dmdn f [m − dx0 (t), n − d y0 (t)]e− j2π [um+vn] dmdn dt
(4.54)
F(u, v)e− j2π [udx (t)+vd y (t)] dt
0
0
T
= F(u, v)
0
e− j2π [udx (t)+vd y (t)] dt 0
0
0
The expression in braces of the second line of the previous equation is motivated by the inversion of the order of integration: this expression corresponds to the Fourier transform of f [m − dx0 (t), n − d y0 (t)]. The penultimate line of the equation is motivated by the translation property of the Fourier transform, and finally the last one is due by the independence of F(u, v) from t. If we define H D (u, v) as follows: T 0 0 H D (u, v) = e− j2π[udx (t)+vd y (t)] dt (4.55) 0
i.e., corresponding to the integral of the last expression of the (4.54), we finally get the known relation: (4.56) G D (u, v) = F(u, v)H D (u, v) The transfer function H D (u, v) given by (4.55) is defined if the linear motion functions dx0 (t) and d y0 (t) expressed by (4.53) are known. If we simplify the uniform linear motion considering it only horizontally, the motion functions become at (4.57) d y0 (t) = 0 dx0 (t) = T
238
4 Reconstruction of the Degraded Image: Restoration
and if t = T we have, in the exposure time, the maximum value of the width of blurring given by dx0 (T ) = a, and consequently the (4.55) is reduced as follows: T 0 H D (u, v) = e− j2πudx (t) dt 0
=
T
e− j2πuat/T dt
0
(4.58)
T sin(π ua)e− jπua π ua = T sinc(π ua)e− jπua
=
It is highlighted, with the presence of the sinc function, that H D (u, v) has zero crossings at points where u = n/a, for values of n integer. The transfer function H D (u, v) in the two horizontal and vertical directions, with the equations of motion given by the (4.53), results H D (u, v) = T sinc[π(ua + vb)]e− jπ(ua+vb)
(4.59)
The blurring process of the image, due to the uniform linear motion, modeled as a spatially invariant linear process, is characterized, in the spatial domain by an impulse response with a rectangular function (translation of the pixels in the horizontal and vertical directions), while in the spectral domain, it is characterized by the relative transfer function H D which models the degradation with the functional component sinc. Figure 4.20b shows the result of the degradation induced by the horizontal linear motion, modeled by the (4.59), setting the parameters of motion a = 0.125 b = 0 and the exposure parameter T = 1.
4.6.4 Derivation of H D by Physical–Mathematical Modeling: Blurring by Atmospheric Turbulence A model of blurring [2] reported In the literature, induced by the physical phenomenon of atmospheric turbulence is given by the following expression: H D (u, v) = e−k(u
2 +v 2 )5/6
(4.60)
with the constant k that characterizes the nature of turbulence. A value of k = 2.5 · 10−3 indicates strong turbulence, while k = 2.5 · 10−4 indicates slight turbulence. Basically, this model of uniform blurring is due to a Gaussian low-pass filter (by setting the exponent 5/6 equal to 1). In Fig. 4.20d, the degraded image is shown, obtained by applying the image of Fig. 4.20c the atmospheric turbulence model described by (4.60), placing k = 0.0025, which corresponds to strong turbulence.pagebreak
4.7 Inverse Filtering—Deconvolution Original Image
a)
239
Degraded by linear motion
b)
Original Image
c)
Degraded due to turbulence
d)
Fig. 4.20 Image degradation through physical–mathematical model: a original image; b result of the linear motion induced by the (4.59) with a = 0.125 b = 0 T = 1; c original image; d result of the degradation induced by the atmospheric turbulence modeled by the (4.60) with parameter k = 0.0025 which implies a strong turbulence
4.7 Inverse Filtering—Deconvolution In the previous paragraph, three ways of estimating the degradation function H D (u, v) in the frequency domain have been described. Applying the inverse deconvolution process, it is now possible to derive an approximation Fˆ I (u, v) of the original image. The inverse filter is based on the following assumption: Note the degradation transfer function, the image reconstruction is obtained by filtering the degraded image with a filter whose transfer function is the inverse of the degradation. The inverse filter can be better understood if we consider that the degraded G D image can be thought of as being caused by a filtering process in the frequency domain. With these assumptions, using the previous notations and the degradation process, for the convolution theorem, the (4.1) rewritten in the frequency domain becomes (4.61) G D (u, v) = FI (u, v) · H D (u, v) where H D (u, v) is the Fourier transform of the degradation transfer function, while G D and FI are the Fourier transforms, respectively, of the degraded image g D and of the original image f I . An approximate estimate of the original image is obtained from the (4.61), note the degradation function H D : G D (u, v) Fˆ I (u, v) = = G D (u, v)H R (u, v) H D (u, v)
(4.62)
The reconstruction filter H R (u, v) introduced in the first paragraph is just the inverse function of H D , that is, 1 H R (u, v) = (4.63) H D (u, v)
240
4 Reconstruction of the Degraded Image: Restoration
Original Image
Reconstructed image
Degraded Image 100 1 0.8 0.6 0.4 0.2 0 40
80 60 40 20 30
20
10
0 0
10
20
30
0 40
40
30 20 10 0 0
10
20
30
40
GD(u,v)
F(u,v) D
HD(u,v)
F(u,v) HR(u,v)
Fig. 4.21 Inverse filtering for image restoration
Figure 4.21 shows the functional scheme of the inverse filter procedure in the frequency domain for the image restoration, degraded by the H D (u, v) filter, and rebuilt from the H R (u, v) filter by applying the (4.62). A realistic inverse filtering situation is also taking into account the additive noise ℵ(u, v) (Fig. 4.1). The Fourier transform of the reconstructed image G R can be calculated by replacing the (4.61) and the (4.63) in the (4.5) obtaining the following expression which also includes the additive noise: 1 G R (u, v) = Fˆ I (u, v) = G N D (u, v) · HD (u,v) = [FI (u, v) · H D (u, v) + ℵ(u, v)] · = FI (u, v) + Hℵ(u,v) D (u,v)
1 H D (u,v)
(4.64)
remembering that G N D indicates the Fourier transform of the degraded image by blurring and the additive noise to distinguish it from the degradation G D due only to blurring. Subsequently, the image reconstructed in the spatial domain g R (m, n) is obtained with the inverse Fourier transform: ∞ ∞ ℵ(u, v) j (mu+nv) 1 g R (m, n) = f I (m, n) + dudv (4.65) e 4π 2 H D (u, v) −∞ −∞
Equation (4.64) shows that, despite the knowledge of H D , the original image could be calculated only with the knowledge of the random additive noise spectrum ℵ(u, v) not easily determinable. In the absence of additive noise, known the degradation model H D , the application of the inverse filter involves the exact reconstruction of the original image as shown in Fig. 4.21. In this hypothesis, in the (4.2), the additive noise η can be neglected with the consequent cancelation of the second term of (4.64), the image obtained is perfectly reconstructed completely coinciding with the original image f I . In reality, it is known that this is not always true for the rise of practical problems of implementation. In fact, it is found that the degradation of the original signal modeled with the H D (u, v) filter can fluctuate around zero (in particular at high frequencies where noise is most apparent) with the consequent non-definability of the inverse filter H R as can be seen from (4.63)). Let’s now analyze in detail the (4.64) and in particular, the H D degradation filter which, fluctuating with values around zero toward the high frequencies (amplified
4.7 Inverse Filtering—Deconvolution
241 5
5
5
x 10
x 10
x 10
3
3
HD(u,v)
0.9
F(u,v)
2.5
0.8
GND(u,v)
3
N(u,v)
GD(u,v)
2.5
2.5
0.7 2
2
0.6
X
1.5
1
=
0.5 0.4
2
+
1.5
=
1.5
1
1
0.5
0.5
0.3 0.2
0.5
0.1 20
40
60
80
100
20
120
F(u,v)
40
60
80
100
Filter of D HD(u,v)
120
x 10
5
3
2.5
GR(u,v)=F(u,v)
4.5 4
HR(u,v)
2
3
80
100
120
20
N(u,v)
Inverse filter Truncated HR(u,v)
40
60
80
100
120
Inverse filter complete HR(u,v)
GND(u,v)=GD(u,v)+N(u,v)
F(u,v)=GND(u,v)xHR(u,v) 5
x 10 3
GNF(u,v) D(u,v) F(u,v)
2.5
20
=
1.5
15
HR(u,v) Complete
10
2.5
1
60
2
3.5 1.5
40
GD(u,v)=F(u,v)xHD(u,v)
GR(u,v)=F(u,v)=GND(u,v)xHR(u,v) 5
20
1
2 0.5
1.5 20
40
60
80
100
120
Truncated
5
0.5
20
20 40
40 60
6080
80100
120 100
120
20
40
60
80
100
120
Fig.4.22 Functional diagram of the inverse filtering process to limit the amplification of the additive noise, caused by the reverse reconstruction filter, typical high-pass filter
by the additive noise), leads to an incorrect evaluation in the estimate of F(uv). Figure 4.22 illustrates this situation in detail: by analyzing the various steps of inverse filtering, you can see how the various spectra are modified by the H D degradation filter and the inverse reconstruction filter H R = 1/H D . In particular, the spectrum of Fˆ using the complete inverse filter is strongly modified in the high-frequency area, amplified by the additive noise included in G N D . An attenuation of the problem is obtained by cutting the high frequencies properly with a threshold, thus producing a modified inverse filter, shown in the figure as the H R tr uncated filter. In summary, it can be said that, in the hypothesis of negligible additive noise, knowing the noise degradation model H D , it is possible, by applying the inverse filtering, to reconstruct exactly the original image. If, on the other hand, the additive noise is not negligible as a first approach, the inverse filter is modified accordingly. The resulting modified filter H R is called pseudo-inverse filter which redefines the (4.63) as follows: 1 i f |H D (u, v)| ≥ (4.66) H R (u, v) = HD (u,v) 0 i f |H D (u, v)| < where is a threshold value to be determined experimentally. Another problem arises considering the large dynamic range of data in the frequency domain. Normally, the H D and FI spectra tend to zero toward the high frequencies. Since with the inverse filter, it divides by the Fourier transform of the filter (see Eq. (4.64)), for very small values of the spectrum amplitude, it is possible to introduce oscillations with uncontrollable consequences on the results. These problems of numerical approximation of the filter (due to the ratio ℵ(u, v)/H D (u, v)) lead to the reconstruction of images with the presence of artifacts even in the presence of a slight additive noise. The problem of image reconstruction has been tested for three different types of degradation.
242
4 Reconstruction of the Degraded Image: Restoration
4.7.1 Application of the Inverse Filter: Example 1 In Fig. 4.23 is shown an image degradation obtained through a Gaussian filter of smoothing 5×5 with variance 1 (first row) and, subsequently, a degradation obtained by adding random noise (σ = 10) with normal distribution (second row). The results of the complete inverse filter are shown, applied to the blurred image, completely rebuilt with the (4.62), note the degradation function H D (u, v). In the second row are reported, the blurred image with random noise, its reconstruction by applying the complete inverse filter (4.64), and its reconstruction by applying the pseudo-inverse filter with the threshold = 0.6 according to the (4.66). The application of the complete inverse filter is observed, due to the presence of random noise that makes the ℵ(u, v)/H D (u, v) component prevail in G N D (u, v), making the rebuilt image totally useless where artifacts are dominant. The application of the pseudo-inverse filter reconstructs the degraded image in an acceptable way. Regardless of the truncation errors, knowing exactly the degradation filter, the Fourier transform of the reconstructed image is identical to the original one. The additional noise, even if slight, instead strongly alters the high-frequency compo-
Original Image
Degraded + random noise σ=10 from normal standard distribution
Degraded with Gaussian low pass
Rebuilt (degraded image + Noise) with complete inverse filter
Rebuilt with complete inverse filter
Rebuilt (degraded image + Noise) with pseudo inverse filter є=.6
Fig. 4.23 Reconstruction of the degraded image. The first row shows original image; degraded image with a Gaussian filter 5 × 5 and variance 1; rebuilt image with a complete inverse filter. The second row shows the original degraded image by blurring with the addition of random noise; its reconstruction by applying the complete inverse filter (4.64); and its reconstruction by applying the inverse pseudo filter with the threshold = 0.6 according to the (4.66)
4.7 Inverse Filtering—Deconvolution
243
nents, making critical the application of the inverse filter, as shown in Fig. 4.22 with the spectral analysis of the entire inverse filtering process.
4.7.2 Application of the Inverse Filter: Example 2 In this example, the inverse filter is applied, analogous to the previous example, to reconstruct the degraded image by the blurring induced by the horizontal linear motion of the object with respect to the image acquisition system. In the first line of Fig. 4.24 is shown the result of the image reconstruction by applying the complete inverse filter. The degradation function due to linear motion is modeled by the (4.59) with the parameters a = 0.125, b = 0, T = 1. Also in this case, with the complete inverse filter, the image is correctly rebuilt. The second row shows the results of the pseudo-inverse filter. Although correctly modeled the blurring of the image, induced by linear motion, an acquisition system always introduces an additive noise that in this case was simulated with σ = 10 as random noise with normal distribution. When the additive noise η is no longer negligible, or when the degradation function is modeled very roughly, Original Image
Degraded + random noise 10% from normal standard distribution
Degraded by linear motion
Rebuilt (degraded image + Noise) with complete inverse filter
Rebuilt with complete inverse filter
Rebuilt (degraded image + Noise) with truncated inverse filter
Fig. 4.24 Reconstruction of the degraded image by linear motion. The first row shows original image; degraded image with the H D degradation function modeled by the (4.59) with the parameters a = 0.125 b = 0 T = 1; reconstructed image with complete inverse filter. In the second row are shown the original image blurred and with random noise; its reconstruction by applying the complete inverse filter; and its reconstruction by applying the inverse filter with the threshold 1 = 60 according to the (4.67)
244
4 Reconstruction of the Degraded Image: Restoration
the spectrum components of the inverse reconstruction filter can be modified more selectively to better filter the high frequencies that cause dominant artifacts in the reconstructed image. For example, a threshold can be used to exclude high-frequency components by modifying the inverse filter H R as follows: 1 i f |u 2 + v2 | ≤ 1 H R (u, v) = HD (u,v) (4.67) 0 i f |u 2 + v2 | > 1 where 1 is the cutting frequency to be determined experimentally. In this example (see Fig. 4.24), the image reconstructed with the complete inverse filter is totally artifacted by the noise, while the reconstructed image removing the high frequencies (the amplitude of the spectrum of H R floating around the zero) with the cut-off threshold 1 = 60, it has enough noise removed, despite the strong starting blurring induced by the linear motion.
4.7.3 Application of the Inverse Filter: Example 3 In the latter example, image degradation is caused by atmospheric turbulence during acquisition. The degradation function is approximated by (4.60) by placing k = 0.0025, the control parameter that equals strong turbulence. Figure 4.25 shows the results of the inverse filter in the same way as in the previous examples. Also in this example, the complete inverse filter reconstructs a totally artifact image, while, by adopting a cut-off frequency 1 = 65, the reconstruction is sufficient, as shown in Fig. 4.25.
4.8 Optimal Filter When we do not have a good knowledge of the degradation of the image, you can choose the search approach of an optimal filter that, on the basis of an evaluation criterion, is able to reconstruct an image that is approached in an optimal way, how much possible, to the original one. The problem is reduced, therefore, in finding an optimal criterion and demonstrating the performance of the filter found. In these hypotheses, the optimal Wiener filter is presented and analyzed. This filter is optimal because it recovers an unknown signal influenced by additive noise. In this context, the scheme of the reconstruction process of Fig. 4.1 is modified in Fig. 4.26 where, of the observed (degraded) image, the degradation function and the original image are not known.
4.8.1 Filtro di Wiener The results of a good image reconstruction are very bound by the possibility of modeling the noise (through assumptions) even when there are not many details of
4.8 Optimal Filter Original Image
Degraded + random noise 10% from normal standard distribution
245 Degradata Hufnagel−Stanley k=0.0025
Rebuilt (degraded image + Noise)
Rebuilt (degraded image + Noise) r
Fig. 4.25 Reconstruction of the degraded image by atmospheric turbulence. The first row shows original image; degraded image with the H D degradation function modeled by the (4.60) with turbulence level parameter k = 0.0025, i.e., strong turbulence; reconstructed image with complete inverse filter. In the second row are shown the original image blurred and with random noise; its reconstruction by applying the complete inverse filter; and its reconstruction by applying the inverse filter with the threshold 1 = 65 according to the (4.67)
the same. The Wiener filter aims to reconstruct an image that has been degraded by a spatially invariable linear process and by an additive noise source, which is not previously known. Before analyzing the Wiener filter, it is useful to focus on some concepts associated with the stochastic processes described in Sect. 6.9, Vol. I, when the histograms and the co-occurrence matrix were calculated considering the image associated with a stochastic process. In this context, it was hypothesized that the image was affected by a random noise, having no prior knowledge of how the acquisition system had contaminated the image. In essence, considering the image derived from a stochastic process, as contaminated by an accidental random noise, is only a useful artifice. Although we know the causes that generate such noise, we do not have an analytical description of it. A knowledge, albeit limited, can be acquired by observing for a certain period the noisy signal even if its trend cannot be completely observed. A random variable can be thought to be influenced by a set of infinite groups of functions. When recording a random event, this can be influenced by one or more group functions, and it is not possible to know exactly which one. To model the noise η, we consider it associated with a random ergodic variable. The definition of ergodic variable derives from the
246
4 Reconstruction of the Degraded Image: Restoration
Fig. 4.26 Block diagram of the optimal reconstruction filter of the degraded image
η(m,n) fI(m,n)
gR(m,n)
F gD(m,n)
hR(m,n)
two ways in which it is possible to calculate the average of a random variable. For simplicity, we consider a one-dimensional random signal η(t) that varies over time. It is possible to calculate an average by integrating a particular member function over time or to combine together the values of all the member functions evaluated at some point in time. The latter average produces an ensemble average in an instant over time. A random variable is called ergodic, if and only if it meets the following conditions: 1. the averages over time of all member functions are equal; 2. the average of the ensemble is constant over time; 3. averages (1) and (2) are numerically the same.
For a random ergodic variable, the averages over time and the average over the set of functions are interchangeable. The expectation operator E {η(t)} denotes the average of the set (of the various sources of noise, for example, the effect of thermal fluctuation, the digitization, etc.) of the random variable η calculated over time t. In the ergodic signal conditions, E {η(t)} also denotes the value obtained when any sample of the variable η(t) is averaged over time: +∞ η(t)d(t) (4.68) E {η(t)} = −∞
Even if we do not know η (t), we can calculate the autocorrelation function of η(t) which by definition is given by +∞ Rη (τ ) = η(t) ∗ η(−t) = η(t)η(t + τ )dt (4.69) −∞
and we know to be an even function. In the context of ergodic random variable, the autocorrelation function becomes the same for all members of the functions (as average in time) thus characterizing all the set. In other words, when we observe that the noise η(t) is associated to a ergodic random variable, even if it is a function not known, one has the knowledge of its function of autocorrelation, which is the only available partial knowledge of the noise η(t). Consequently, we can calculate the Fourier transform of the autocorrelation function Rη (τ ), and remembering that η(t) is real, we get Pη (u) = Rη (τ ) = {η(t) ∗ η(−t)} = ℵ(u) · ℵ(−u) = ℵ(u) · ℵ∗ (u) = |ℵ(u)|2
(4.70)
which, as shown, corresponds to the power spectrum of the noise function η(t). Since the latter is real, the power spectrum Pη is also real, as is the autocorrelation function
4.8 Optimal Filter
247
fI(m,n)
Linear D hD(m,n)
gND(m,n) gD(m,n)
η(m,n) e(m,n) gR(m,n)
F hR(m,n)
Fig. 4.27 Functional diagram of the Wiener filter based on minimizing the Mean Square Error (MSE)
Rη (τ ). This implies that it is possible to know the magnitude spectrum of the noise function η(t) without knowing its phase spectrum. In fact, in the context of ergodic random variable, the set is composed of infinite groups of functions that differ only in their phase spectrum. In real contexts, it is found that noise can be satisfactorily modeled as a random ergodic variable. For example, repeated observations of white noise signals show that the measured power spectrum value is constant at various frequencies. The Wiener filter is a noise reduction linear filter. It is assumed that the degraded image g N D (m, n) is derived from the ideal image f I (m, n) degraded by an additive noise described by η(m, n). The goal is to design a linear filter to minimize the undesired effect induced by noise η. In other words, we want to recover the source signal f I (m, n) as possible. The optimal filter (Fig. 4.27) to be designed should recover the reconstructed image g R with characteristics such as to be almost equal to the image of origin f I . Without the knowledge of f I and η, it is not possible to obtain any noise reduction filter. A first assumption we can make would be to consider the input image f I and the additive noise η as associated with ergodic stochastic processes. This would constitute a partial knowledge of the original image and of the noise function. This implies the knowledge of the relative power spectra P f and Pη , respectively, of the image and of the noise function. In other words, even if we do not know f I and η exactly, we can state that deriving from ergodic random processes, we know that they come from a set of functions all having the same autocorrelation function and consequently, the same power spectrum. The latter, for example, can be calculated by acquiring N images f I and η over time. This value can therefore be representative of the respective ensemble of functions. The impulse response of the reconstruction filter h R (m, n) is chosen to minimize the mean square reconstruction error given by e R M S = E e2 (m, n) = E[ f I (m, n) − g R (m, n)]2 (4.71) such that g R is comparable with the ideal image f I as much as possible. E{•} indicates the average operator or expectation value. This approach, in reconstructing only an optimal estimate of the ideal image, follows from the consideration that, in general, a linear filter is not sufficiently robust to filter the noise present in the image. A further assumption provides that the statistics of the images f I and g D do not
248
4 Reconstruction of the Degraded Image: Restoration
depend on the spatial position of the pixels. This derives from the hypothesis that the process of image degradation is assumed as an ergodic random process. Stochastic processes with this characteristic are also called homogeneous or stationary. The problem of determining the impulse response of the h R filter that minimizes the mean square error e(t) is reduced to the problem of minimizing a functional that implies the following aspects: (a) Derive a functional for the mean square error in terms of h R , the impulse response of the filter; (b) Find an expression for the optimal impulse response h 0R in terms of known power spectra, P f and Pη ; (c) Develop an expression for the mean quadratic error and e(t) that includes the optimal impulse response h 0R . Based on the above theory, it is shown that the optimal Wiener filter, expressed as a transfer function in the frequency domain, is given by H R0 (u, v) =
P f I g D (u, v) Pg D (u, v)
= Hw (u, v)
(4.72)
estimated from the ratio between the power spectrum of the cross-correlation function R f I g D (τ ) between the input image f I and the degraded image g D , and the power spectrum of the autocorrelation of the degraded image g D . This means that the essential steps to estimate the Wiener filter are the following: 1. Digitize a sample of the input image f I ; 2. Compute the autocorrelation function R f of image f I to have an estimate of the autocorrelation function Rg D of the degraded image g D ; 3. Compute the Fourier transform of autocorrelation function Rg D to produce the power spectrum Pg D of the degraded input image; 4. Digitize the input image without noise η; 5. Compute the cross-correlation function R f I g D between f I and g D to estimate the correlation between input image and degraded input image as follows: +∞ f I (t)g D (t + τ )dt (4.73) R f I g D (τ ) = f I (t) ∗ g D (−t) = −∞
6. Calculate the Fourier transform of R f I g D to produce the power spectrum P f I g D ; 7. Calculate the optimal Wiener filter transfer function Hw (u, v) with the preceding equation; 8. If you want to implement the filter as a convolution, calculate the inverse Fourier transform of Hw to produce the impulse response h w (m, n) of the Wiener filter. In real applications, it is not always possible to have samples of the input image without noise f I (point 4), or an approximation of the degraded image (point 5). In these cases, we can try to derive a functional form of correlation functions and power spectra. For example, we could assume the constant power spectrum (white
4.8 Optimal Filter
249
noise) or with a shape defined a priori. This is because generally noisy images or signals have a flat power spectrum or that slowly decreases toward high frequencies compared to non-noisy images.
4.8.1.1 Wiener Filter with Original Image f I and Noise η Uncorrelated If the input image is not correlated with noise η, then the optimal Wiener filter becomes P f I (u,v) with (u, v) = 0 Hw (u, v) = P f (u,v)+P (4.74) η (u,v) I
In a real acquisition system, the input image is normally degraded by the H D transfer function of the optical system (which normally produces a slight blur in the input image f I ) and by the additive noise η(m, n) that degrades overall the image denoted by g N D (from now on we will indicate with the lower subscript N D the set of the blurring degradations D and of random noise η). We have already seen previously with Eq. (4.2) that this g N D degradation of the image can be considered as obtained by adding the effects of the linear system (convolution between the input image and the impulse response h D responsible for the blurring) and of the noise additive η, as follows: g N D (m, n) = f I (m, n) ∗ h D (m, n) + η(m, n)
(4.75)
Normally, we want to apply the optimum filter of Wiener Hw that reproduces a good approximation g R (m, n) of the input image f I . In other words, once we have estimated the transfer function of the Wiener filter HW from the observed spectra, we intend to apply this filter as an inverse filter to deconvolve the degraded image g N D with the impulse response h W of the Wiener filter and to recover as possible a g R approximation of the input image f I (see Fig. 4.27). To derive the Wiener filter, Eq. (4.74), we rewrite the degradation Eq. (4.75) in the Fourier domain: G N D (u, v) = FI (u, v) · H D (u, v) + ℵ(u, v) = K(u, v) + ℵ(u, v)
(4.76)
where K(u, v) = FI (u, v) · H D (u, v) represents the degraded image G D with the convolution of the original image FI and the function of degradation H D . In the uncorrelated image and noise conditions, a measure of the quality of the reconstructed image g R , according to the criterion of minimizing the mean square error, we rewrite the (4.71) in the Fourier domain as follows: e2R M S =
+∞ +∞ |FI (u, v) − G R (u, v)|2 dudv
(4.77)
−∞ −∞
The goal is to find a functional τ (m, n) that, when applied to the degraded image with noise g N D and deconvoluted with the degradation function h R , produces a reconstruction g R which is a good estimate, as much as possible, of the original image f I . That said, in the Fourier domain, we would have G R (u, v) =
T (u, v) · G N D (u, v) H D (u, v)
(4.78)
250
4 Reconstruction of the Degraded Image: Restoration
which replaced in the (4.77) and developing, we get 2 K(u,v) − T (u,v)[K(u,v)+ℵ(u,v)] dudv H (u,v) H D (u,v) −∞ −∞
+∞ +∞ −2 2 |K(u, v)| [1 − T (u, v)]2 + |ℵ(u, v)|2 T 2 (u, v) dudv |H (u, v)| =
e2R M S =
+∞ +∞
(4.79)
−∞ −∞
Assuming uncorrelated ℵ(u, v) and K(u, v) the integral of their product is zero, and it follows that +∞
−∞
+∞
−∞
K(u, v)ℵ(u, v) = 0
To minimize e R M S is equivalent to minimizing Eq. (4.79) integrating with respect to the function T (u, v) for all (u, v), and then, deriving with respect always to T (u, v), we obtain (4.80) −2|K(u, v)|2 [1 − T (u, v)] + 2|ℵ(u, v)|2 T (u, v) = 0 from which we obtain the optimal functional T (u, v) given by |K(u, v)|2 (4.81) |K(u, v)|2 + |ℵ(u, v)|2 Replacing the optimal functional T (u, v) in the approximate reconstruction Eq. (4.78) of the original image, we have
T (u, v) · G N D (u, v) G R (u, v) = H D (u, v) (4.82)
|K(u, v)|2 · G N D (u, v) = [|K(u, v)|2 + |ℵ(u, v)|2 ]H D (u, v) T (u, v) =
According to Fig. 4.27, the expression in square brackets is just the reconstruction filter H R (u, v) in the Fourier domain corresponding to the Wiener filter HW (u, v). We remember from the (4.76) (first addend) that K(u, v) = FI (u, v) · H D (u, v) represents the original image degraded by the function H D (u, v) and replacing in the previous one, the approximate reconstruction equation of the original image results
G R (u, v) =
|H D (u, v)|2 |H D (u, v)|2 +
|ℵ(u,v)|2 |FI (u,v)|2
H D (u, v)
· G N D (u, v) = HW (u, v)G N D (u, v) (4.83)
where HW (u, v) is the Wiener deconvolution filter in the frequency domain. Figure 4.28 shows the functional deconvolution scheme for the reconstruction of the G R (u, v) spectrum based on the Wiener filter in analogy to that described for the inverse filter (see Fig. 4.22). In this case, the functional scheme, up to the degraded image G N D (u, v) is identical. The second part of the diagram shows how the deconvolution action of the Wiener filter HW (u, v) replaces that of the inverse filter by reconstructing the G R (u, v) spectrum, an approximation of that of the original image F(u, v). In particular, the one-dimensional graphs of two types of Wiener filter spectrum are shown (two variants of (4.74) as we shall see later). The diagram shows the white noise with a constant trend of the corresponding spectrum and the blur transfer function is the typical one generated by an optical system. Observe that the performance of the Wiener filter transfer function for
4.8 Optimal Filter
251 5
5
5
x 10
x 10
x 10
3
3
HD(u,v)
0.9
F(u,v)
2.5
0.8
GND(u,v)
3
N(u,v)
GD(u,v)
2.5
2.5
0.7 2
2
0.6
X
1.5
=
0.5 0.4
2
+
1.5
=
1
1
1.5
1
0.3 0.2
0.5
0.5
0.5
0.1 20
40
60
120
100
80
20
40
60
80
100
20
120
40
60
80
120
20
N(u,v)
F(u,v) filter HD(u,v)
100
40
60
80
100
120
Wiener filter
GND(u,v)=GD(u,v)+N(u,v)
HW(u,v)
GD(u,v)=F(u,v)xHD(u,v)
GR(u,v)=F(u,v)=GND(u,v)xHW(u,v) 5
3
2.5
5
5
x 10
x 10
x 10
GR(u,v)=F(u,v)
3
2.5
2
2
1.5
1.5
1
1
HW (u,v)
5
GR(u,v)=F(u,v)
=
HR(u,v)
3 2.5
HW(u,v)
2
1
0.5
2.5
4 3.5
1.5 0.5
GND(u,v)
3
NSR
4.5
X
2
1.5
1
0.5
0.5 20
40
60
80
100
120
20
40
60
80
100
120
20
40
60
80
100
120
20
40
60
80
100
120
Fig. 4.28 Functional scheme of the process of degraded image reconstruction (blurred + random noise) using the Wiener filter
deconvolution is increasing toward low frequencies (to compensate for blur effects) while at high frequencies, it tends to decrease to zero, to attenuate the additive noise. The dotted line shows the spectrum of the inverse filter truncated. In many real images, the adjacent pixels are strongly correlated, while the gray levels per pixel very far between them are less correlated. It follows that the autocorrelation function of these real images normally decreases to zero. It has already been pointed out that the power spectrum of an image is the Fourier transform (real and even) of its autocorrelation function. One can then affirm that even the power spectrum of an image generally decreases toward the high frequencies. Noise sources typically have the uniform power spectrum or that decrease, with increasing frequencies, less rapidly than the image’s power spectrum. Consequently, it is verified that the power spectrum of the image predominates over that of the noise, in particular for the low frequencies, while toward the high frequencies the noise prevails. This implies that the Wiener deconvolution filter (inverse filter) has to decrease toward the high frequencies in order to filter the noise present in the high frequencies. The transfer function of the Wiener filter (4.83) may be useful to express it also as a function of the power spectra of the original image P f I (u, v) and of the additive noise Pη (u, v). In literature, there are several versions. Considering from (4.83) the Wiener filter, this can be rewritten as follows: HW (u, v) =
H D∗ (u, v) |H D (u, v)|2 + Pη (u, v)/P f I (u, v)
|H D (u, v)|2 1 = H D (u, v) |H D (u, v)|2 + Pη (u, v)/P f I (u, v) where H D (u, v) represents the degradation transfer function;
(4.84)
252
4 Reconstruction of the Degraded Image: Restoration
H D∗ (u, v) is the complex conjugate of H D evaluable denying the imaginary part of the complex number; P f I (u, v) is the power spectrum of the original image that is the Fourier transform of its autocorrelation function; Pη (u, v) is the power spectrum of noise; |H D (u, v)|2 = H D∗ (u, v) · H D (u, v) is the power spectrum (remembering the (4.70)) of the autocorrelation function of a function, in this case is the degradation function H D . This expression motivates the final version of the Wiener filter (equazione (4.84)). If the additive random noise component is zero, that is, Pη (u, v) is 0 for all u and v frequencies, the Wiener filter is reduced to the classic inverse filter HW = 1/H D which, as is known, eliminates the blurring introduced by the optical system. An approximation of the Wiener filter is given if we consider the ratio Pη /P f I equal to a constant K . In this case, the simplified version of (4.84) is given by HW (u, v) =
|H D (u, v)|2 |H D (u, v)|2 1 1 = 2 H D (u, v) |H D (u, v)| + K H D (u, v) |H D (u, v)|2 + [1/S N R]
(4.85)
where the constant K is calculated empirically after several attempts, choosing the value that produces the best result in removing the degradation. K can be considered as the inverse of the signal-to-noise ratio, i.e., K = 1/S N R where S N R = P f I /Pη and its evaluation is done with the trial and error approach assigning different values to the ratio, for example, S N R = 1, 5, 10, . . .. In this way, it is not necessary to calculate the spectrum of the degradation and of the original image. In fact, this simplification of the filter is useful when the two spectra are not known or not easily estimable. This approximation is acceptable when the noise varies with the signal variation. In this case, the power spectrum of the degradation tends to be comparable with that of the entire image and therefore K remains constant. A further advantage can be verified if K is made to vary in function of the frequencies (u, v) and in particular to assume high K values toward the high frequencies where the noise dominates and vice versa (see next the conditioned filter at least squares). In the case of white noise, the value of the coefficients in the Fourier domain is constant, and therefore, as the signal varies, K does not remain constant and this approximation is not acceptable. The best Wiener filter approximation conditions occur only when the signal is dominant with respect to degradation. Finally, we recall that the approximation of the Wiener filter is chosen when there is no knowledge of the degradation model. Figure 4.28 shows two types of Wiener filter spectrum obtained from (4.84) and (4.85), indicated in the figure by HW (u, v) and H N S R (u, v), respectively. We can observe the remarkable difference of the Wiener filters, compared to the truncated inverse filter: they have a growing spectrum in the low frequencies, to contrast the blurring effect introduced by H D (u, v), and decreasing toward the high frequencies to attenuate the noise introduced by ℵ(u, v). Finally, notice the difference compared to the truncated reverse filter, since the filtering toward the high frequencies does not occur in an abrupt way.
4.8 Optimal Filter
253
4.8.2 Analysis of the Wiener Filter In real images, adjacent pixels are very closely related to each other while they are less correlated as they are considered very distant pixels. In the hypothesis that the ideal image f I (m, n) represents a homogeneous stochastic process (the average over the entire image < f I > = constant), its autocorrelation function depends only on the distance of the pixels: R f I (i j) =
M−1 M−1
f I (l, k) · f I (i + l, j + k)
(4.86)
l=0 k=0
From the above considerations, the autocorrelation function of a real homogeneous image always has a peak at the origin and decreases in a symmetrical way. For the properties of the autocorrelation function, when applied the Fourier transform in the frequency domain, we have that the spectrum of the ideal image f I corresponds to the Fourier transform of its autocorrelation function. From this, it follows that also the Fourier spectrum of the ideal image f I decreases symmetrically toward the high frequencies. It is shown that any image has a single power spectrum but the opposite is not true. This greatly limits the processing operations (classification, segmentation, pattern recognition, etc.) of images in the frequency domain considering that for a given spectrum, there can be images with different information content. The noise power spectrum is normally flat. As can also be seen from the graph of Fig. 4.28, normally the signal spectrum dominates in low frequencies, while that of noise dominates at high frequencies. Consequently, the signal-to-noise ratio S N R(u, v) = P f (u, v)/Pη (u, v) results to have high values in the low frequencies and decreases toward the high frequencies where the signal and noise trend is reversed. This explains the characteristic Wiener filter curve that behaves like a band-pass filter. In particular, it performs a dual function, i.e., it is an inverse filter at low frequencies (increasing trend) and it is a smoothing filter at high frequencies (decreasing trend to attenuate frequencies where noise dominates). The Wiener filter based on the optimal criterion in the sense of the mean square error is not very effective for improving the visual qualities of a degraded image because the degradation is attenuated with uniform weight, operating on all the pixels of the image. The human visual system, on the other hand, tends to be adaptive, in the sense that it tolerates differently the extent of degradation in the various light and dark areas of the observed scene. The Wiener filter with respect to the human visual system tends to smooth out the image even in areas where it is not necessary. This is also found when trying to restore degraded images due to the movement between the acquisition system and moving objects, in particular to correct the effect of the rotation that degrades the image unevenly. Similarly, it happens to correct the astigmatism and the curvature effect of the field introduced by the lens. To reduce these limitations, some variants of this filter and several other filters have been proposed, such as, for example, the one based on the equalization of the filter spectrum, the parametric Wiener filter, the geometric medium filter, the conditioned filter, and not least-square filter.
254
4 Reconstruction of the Degraded Image: Restoration
4.8.3 Application of the Wiener Filter: One-Dimensional Case In the hypothesis of uncorrelated signal and noise, the Wiener filter transfer function, rewriting the (4.74), we remember being Hw (u) =
P f I (u) Pη (u) + P f I (u)
(4.87)
and the mean square error is given by +∞ 2 eW = −∞
P f I (u)Pη (u) du = P f I (u) + Pη (u)
+∞ Pη (u) · HW (u)du
(4.88)
−∞
Figure 4.29 shows the entire deconvolution process with the Wiener filter for the reconstruction of a one-dimensional f I (m) input signal degraded by an H D (u) smoothing filter and contaminated by random white noise. In this example, the SNR, signal-to-noise ratio, is of the order of the unit. In the first row are shown the original signal, the signal g D (m) after the blurring degradation, and the random white noise η(m). In the second row we observe the signal g D (m) + η(m) after the degradation and the addition of random noise; the reconstruction of the signal f I (m) starting from the degraded signal g D (m) using the inverse filter; and the reconstruction of the original signal starting from g D (m)+η(m) using the complete inverse filter. It should be noted that the original signal is completely reconstructed through the inverse filter starting from g D (m), while the reconstructed signal g N D (m) is very noisy and unusable if we start from the signal that also includes random noise. The third row shows the reconstructed signal with the truncated inverse filter, which reproduces an approximation g R (m), with acceptable noise, of the original signal. Finally, the result of the reconstruction is highlighted by applying the one-dimensional Wiener filter, expressed, respectively, by (4.84) and (4.85). The results of the Wiener filter are clearly better than those obtained by applying the inverse filter. The Wiener filter behaves like a low-pass filter which tends to recover an approximation of the original signal on the one hand and to attenuate the
fI : Original Signal
gD : Degraded signal by blurring
η: Random Noise
Signal fI Rebuilt by gD+η with
gD+ η: Degraded Signal + random noise
Signal fI Rebuilt by gD+η with
Signal fI Rebuilt by gD with
Signal fI Rebuilt by gD+η with
Signal fI Rebuilt by gD+η with
Pf and Pη
Fig.4.29 Evaluation of the Wiener filter with respect to the inverse filter applied on one-dimensional degraded signal with blurring and random noise (description in the text)
4.8 Optimal Filter
255
random noise on the other. Nevertheless, the reconstructed signal still has a residual error due to the presence of some low-frequency noise components that the filter is unable to filter.
4.8.4 Application of the Wiener Filter: Two-Dimensional Case In this section, we will present the results obtained with the Wiener filter applied to the same degraded images used for the inverse filter. In Fig. 4.30 is shown the degraded image using a 5 × 5 Gaussian smoothing filter with variance 1 with the addition of random noise (σ = 10) with normal distribution. Furthermore, are reported the results obtained with the pseudo-inverse filter with the threshold = 0.6 (according to Eq. (4.66)) and those obtained with the Wiener filter expressed by the (4.84). The best performance of the Wiener filter is evident in terms of image sharpness (very blurring attenuated) and random noise attenuation. Figure 4.31 shows the image degraded by linear motion and random noise as described in the caption. The transfer function, responsible for the degradation of the image (motion blur) caused by the relative motion of the object with respect to the camera (or vice versa) with open shutter, can be modeled according to (4.67). Also in this case, under the same degradation conditions, with the Wiener filter, better results are obtained compared to the pseudo-inverse filter. Figure 4.32 shows the results of the Wiener filter applied to remove image degradation caused by atmospheric turbulence and random noise. In the considered examples, the application of Eq. (4.84) is motivated by the uncorrelated signal and noise hypothesis, and assumed the power spectra P f I of the image and of the noise Pη are known. If these power spectra are not known, one can proceed by trial hypothesizing the estimate of the ratio K = Pη /P f I , thus obtaining an approximation G R (u, v) Degraded + random noise σ=10 from normal distribution
Rebuilt degraded image (from blurring + rand. noise) with pseudo inverse filter є=.6
Result with the Wiener filter
Fig. 4.30 Reconstruction of the degraded image by deconvolution with the Wiener filter. From left to right are reported the degraded image with a 5 × 5 Gaussian smoothing filter with variance 1 and with the addition of random noise; its reconstruction by applying the pseudo-inverse filter with the threshold = 0.6 according to the (4.66); and the reconstructed image by applying the Wiener filter expressed by the (4.84)
256
4 Reconstruction of the Degraded Image: Restoration
Degraded + random noise σ=10 from normal distribution
Rebuilt degraded image (from blurring + rand. noise) with truncated inverse filter
Result with the Wiener filter
Fig. 4.31 Reconstruction by deconvolution with the Wiener filter of an image degraded by linear motion. From left to right are reported the degraded image with the degradation function H D (modeled by the (4.59) with the parameters a = 0.125, b = 0, V = 1) with added random noise; its reconstruction by applying the inverse filter with the threshold 1 = 60 according to the (4.67); and the reconstructed image by applying the Wiener filter expressed by (4.84) Degraded + random noise σ=10 from normal distribution
Rebuilt degraded image (from blurring + rand. noise) with truncated inverse filter
Result with the Wiener filter
Fig. 4.32 Reconstruction of the image degraded by atmospheric turbulence. From left to right are shown degraded image with the degradation function H D modeled by the (4.60) with parameter of the turbulence level k = 0.0025 (i.e., strong turbulence) with added random noise; its reconstruction by applying the inverse filter with the threshold 1 = 65 according to the (4.67); and the reconstructed image applying the Wiener filter expressed by the (4.84)
of the spectrum of the original image through the (4.85) (also called the Wiener parametric filter). In real applications, the results of the Wiener filter are conditioned by the input signal or image which are not always derived in the conditions of a random ergodic process as assumed by the Wiener filter. In fact, in Fig. 4.28, it is possible to observe the strong oscillations of the Hw transfer function caused by the intrinsic noise present in the image not correctly modeled. This, in practice, violates the starting conditions required by the Wiener filter which assumes the power spectrum of the uniform noise. To manage the oscillations of the Hw (u, v) filter, it may be useful to apply a smoothing. A solution can also be the use of the Wiener parametric filter which, for appropriate values of the K parameter, has a profile of the transfer function without oscillations as shown in Fig. 4.28 and indicated with HW N S R (u, v).
4.8 Optimal Filter
257
Although the signal or image is originally contaminated by violating the assumptions of ergodicity, the Wiener filter Hw (u, v), even when operating in the quasioptimal context, produces good results. From the analysis of the above results, it can be observed that the inverse filter operates discreetly until the noise is contained within certain limits, while the best results are obtained with the Wiener filter even when the noise grows discreetly.
4.9 Power Spectrum Equalization—PSE Filter As the Wiener filter is a filter that uses only the spectrum amplitude of the degraded image to derive the H P S E (u, v) reconstruction filter in the Fourier domain by imposing the condition that the power spectrum of the image not degraded P f I (u, v) is equal to the power spectrum of the estimated (desired) image: ˆ ⇐⇒ |F(u, v)|2 = | F(u, v)|2 P f I (u, v) = P ˆ (u, v) fI
Basically, the spectrum of the degraded image G(u, v) is used to derive the filter to recover the image: Fˆ I (u, v) = H P S E (u, v) · G(u, v) From the constraint of the equality of the power spectra and as imposed for the Wiener filter it is assumed that noise η and signal/image f are independent (uncorrelated), replacing the linear degradation Eq. (4.49) in the previous equation, we have Fˆ I (u, v) = H P S E (u, v) · G(u, v) = H P S E (u, v)[H D (u, v)FI (u, v) + ℵ(u, v)] Since η and f are uncorrelated by the correlation theorem, we have that ℵ∗ (u, v)FI (u, v) = 0
and
FI∗ (u, v)ℵ(u, v) = 0
for which we have
| Fˆ I (u, v)|2 = Fˆ I (u, v) Fˆ I∗ (u, v) = |H P S E (u, v)|2 |P f I (u, v)|2 |FI (u, v)|2 + |ℵ(u, v)|2
By virtue of the criterion of equality of the power spectra P f I (u, v) = P fˆI (u, v), the previous equation can be rewritten as follows: |FI (u, v)|2 = |H P S E (u, v)|2 |P f I (u, v)|2 |FI (u, v)|2 + |ℵ(u, v)|2 Solving with respect to |H P S E (u, v)|, after appropriate simplifications, we obtain the PSE filter given by the following expression: 1
2 P f I (u, v) H P S E (u, v) = 2 |H D (u, v)| · P f I (u, v) + Pη (u, v) (4.89) 21
1 = P (u,v) |H D (u, v)|2 + P fη (u,v) I
with all the symbols already defined previously. Note that this filter is calculated only with the amplitude H D (u, v) and is not ill-conditioned in areas where H D tends
258
4 Reconstruction of the Degraded Image: Restoration
HPSE(u,v)
Hw(u,v) Wiener 0.35
5 4.5 4 3.5 3 2.5 2 1.5 1 0.5
0.3 0.25 0.2 0.15 0.1 0.05 20
40
60
u
80
Degraded Image (blurring + rand. noise)
100
120
20
Result with the Wiener filter
40
80
60
100
120
u Result with the PSE filter
Fig. 4.33 Power Spectrum Equalization filter (PSE). The first row shows the transfer function curves of the Wiener filter and the equalized PSE filter, respectively. In the second row, from left to right, are reported the degraded image with a 5×5 Gaussian filter and variance 1 and with the addition of random noise; its reconstruction applying the Wiener filter expressed by the (4.84); and restored image with the PSE filter
to zero due to the presence of the term Pη /P f I . Figure 4.33 shows this situation by comparing the profile of the Wiener and PSE filters. The estimated image is given by fˆ(m.n) = h P S E (m, n)g(m, n) where h P S E (m.n) is obtained with the inverse Fourier transform of H P S E (u, v). An alternative criterion for calculating this filter is to set the following functional to be minimized: |FI (u, v)|2 − | Fˆ I (u, v)|2 = 0 u
v
or by using Parseval’s theorem by minimizing the following functional: | f I (m, n)|2 − | fˆ(m, n)|2 = 0. m
n
This filter in the literature is also known as a homomorphic filter. It is often used as an alternative to the Wiener deconvolution filter and for particular degraded images can give good results.
4.10 Constrained Least Squares Filtering
259
4.10 Constrained Least Squares Filtering Almost all the filters operating in the frequency domain introduce artifacts in the spatial domain on the reconstructed image. Moreover, the Wiener filter has the constraint, not always solvable, that the power spectrum of the non-degraded image and of the noise must be known. To mitigate such artifacts and constraints, Constrained Least Squares—CLS filters introduce some smoothing criteria that minimizes the effect of strong oscillations in the reconstructed image. The CLS filtering only requires knowledge of the mean and variance of the noise that as done previously these parameters can be usually estimated from the degraded image. The degraded observed image g(m, n) can be expressed, according to the degradation Eq. (4.48) and the definition of discrete spatial convolution (between original image f (m, n) and impulse response h(m, n)), in vector-matrix notation as follows: g = Hf + η
(4.90)
where f, g and η are expressed as column vectors of size M N × 1 while the matrix H, consisting of elements given by the definition of convolution, has considerable dimensions M N × M N for M × N image size. The meaning of the symbols remains that of the previous paragraphs. The key to the problem now becomes the calculation of the matrix H which, besides being large, is very sensitive to noise considering that Eq. (4.90) is ill-conditioned producing uncertain solutions based on strong noise oscillations. A solution used to reduce oscillations is to formulate a criterion of optimality. The (4.90) can be rewritten in the following form: g − Hf = η
(4.91)
One way to take noise into account can be to introduce a minimization constraint on the nor m of both members of the (4.91) that becomes the same, that is, g − Hf 2 = g − h ∗ fˆ 2 = η 2
(4.92)
where fˆ is an estimate of the original image f (undegraded image). Jain [3] proposed an approach that minimizes the functional C = p(m, n) ∗ fˆ(m, n) 2 conditioned by (4.92). Normally, the Laplace operator (see Sect. 1.12) is chosen to measure the smoothness level of the estimated image. For example, a standard smoothness function is the impulse response of the Laplacian filter p(m, n) whose mask is given by ⎡ ⎤ 0 −1 0 p(m, n) = ⎣ −1 −4 −1 ⎦ (4.93) 0 −1 0 which is known to be a good approximation of the Laplacian operator ∇ 2 high-pass filter. In other words, we want to obtain a reconstructed image that is smoothed as much as possible but not very different from the original one. Passing in the frequency domain, the functional to be minimized becomes F {C}(u, v) = P(u, v) Fˆ 2
(4.94)
260
4 Reconstruction of the Degraded Image: Restoration
conditioned by ˆ G(u, v) − H (u, v) · F(u, v) 2 = ℵ(u, v) 2
(4.95)
The Fourier transform of the Laplacian, (4.93), is given by P(u, v) = −4π 2 (u 2 + v2 )
(4.96)
The transfer function of the CLS filter is obtained through the Lagrange multiplier method. A possible solution is given by HC L S (u, v) =
|H (u, v)|2 H ∗ (u, v) 1 = 2 2 |H (u, v)| + γ |P(u, v)| H (u, v) |H (u, v)|2 + γ |P(u, v)|2
(4.97)
where γ is an adjustment factor evaluated only experimentally in relation to the type of degraded image and P(u, v) is the Fourier transform of the function that represents the smoothness criterion given by (4.96). The reconstructed image is obtained by considering the inverse Fourier transform of the result obtained by multiplying the transfer function HC L S of the CLS filter and the Fourier transform of the degraded image (signal + noise) G as follows: ˆ v) = −1 {G(u, v) · HC L S (u, v)} (4.98) g R (m, n) = −1 F(u, We recall that the value of the parameter γ controls how much the condition of smoothness must weigh while the trend of P(u, v) determines with what force different frequencies are influenced by the smoothness conditions. The parameter γ can be adjusted iteratively with the following criterion: it is evaluated first ˆ T (γ ) = G(u, v) − H (u, v) · F(u, v) 2 then γ is incremented, if T (γ ) ℵ(u, v) 2 . Starting from an initial value of γ through an iterative process, T (γ ) is evaluated. If it is small enough, the process is blocked and the results of the estimated image are verified. If they are not acceptable, choose a new value of γ and continue with a new iteration. Recall that the spectral power ℵ(u, v) 2 corresponds to the variance of the noise if it has zero mean. If γ = 0 the CLS filter becomes the inverse filter, while it becomes equal to the Wiener filter if γ = Pη (u, v) and P f (u, v) = 1/|P(u, v)|2 . Figure 4.34 shows the results of the CLS filter applied to the test images Lena and Cameraman blurred, respectively, with Gaussian filter and linear motion and random noise (same degradations as the previous ones). The images of the third column correspond to the value of γ that best conditioned Eq. (4.92).
4.11 Geometric Mean Filtering A very general equation of the inverse filtering and the Wiener filter, from which it is possible to derive different filter transfer functions in the frequency domain, is the
4.11 Geometric Mean Filtering Degraded Images (blurring + rand. noise)
261
γ=0.0684 Iter. 5
γ=0.00348 Iter. 8
γ=0.0904 Iter. 11
γ=0.0259 Iter. 6
γ=0.00923 Iter. 10
γ=0.0972 Iter. 13
Fig. 4.34 Application of the CLS filter. The images of the third column are those with the optimal γ value which conditions the functional (4.92) in the sense of least squares. The images of the second and fourth columns relate, respectively, to the previous and following iteration obtained with the wrong noise parameter
following:
HG M (u, v) =
H D∗ (u, v)
α
|H D (u, v)|2
⎤1−α H D∗ (u, v) ⎣ ⎦ P (u,v) |H D (u, v)|2 + β P fη (u,v) ⎡
(4.99)
I
The symbols continue to have the same meaning, while the parameters α and β can assume real positive constant values. With values of α = 1/2 and γ = 1, we obtain the H P S E power spectrum equalization filter already considered previously. With value of α = 1, the previous equation becomes equal to the generic deconvolution filter H (u, v) = 1/H D (u, v). With a value of α=1/2, the previous equation becomes a mean filter between the generic deconvolution filter (inverse filter) and the Wiener deconvolution filter. From this derives the name of filter of geometric mean even if such name is associated to the general equation. The reconstructed image is obtained by considering the inverse Fourier transform of the product between the HG M filter transfer function and the Fourier transform of the degraded image (signal + noise) G N D as follows: g R (m, n) = −1 {G R (u, v)} = −1 {G N D (u, v) · HG M (u, v)}
(4.100)
With the value of α = 0, we have the filter called the parametric Wiener filter which is the following: H P W (u, v) =
H D∗ (u, v) P (u,v) |H D (u, v)|2 + β P fη (u,v) I
(4.101)
262
4 Reconstruction of the Degraded Image: Restoration
Degraded (blurring + rand. noise)
β=20
Degraded (motion blurr. + rand. noise)
β=6
Fig. 4.35 Application of the Geometric Mean filter. For the two degraded images, Lena with blurring and random noise, cameraman with blurring from linear motion and random noise, the best results are obtained, respectively, for β = 20 and β = 6
and g R (m, n) = −1 {G R (u, v)} = −1 {G N D (u, v) · HG M (u, v)} .
(4.102)
Basically, we have the Wiener filter equation with the parameter β used as an adjustment parameter. In fact with β = 1, this filter coincides with the standard Wiener one, while with β = 0 this filter becomes the inverse filter. Varying β between 0 and 1 the result of the filter corresponds to the result of these two filters. The parametric Wiener filter with lower β values of the unit and the geometric mean filter seem to give good results compared to many other filters considered. Remember that the filters obtained with the previous general equation are applicable in the context of blurring of the image attributable to space-invariant processes and uncorrelated noise with the signal. Furthermore, the blurring of the image and the additive noise must not be very marked. Figure 4.35 shows the results of the GM filter applied to the usual test images degraded by blurring and random noise of the same entity to the preceding figures. The best results are obtained for β = 20 values with P S N R = 63 dB for the Lena image and β = 6 with P S N R = 63 dB for the cameraman image.
4.12 Nonlinear Iterative Deconvolution Filter The reconstruction methods described so far are linear. In recent years, various nonlinear reconstruction algorithms have been developed with the particularity of being iterative, offering better results than linear ones that normally require only one step. A nonlinear iterative algorithm is that of Lucy–Richardson (LR) based on the maximum likelihood where the image is modeled with a Poisson stochastic process.
4.12 Nonlinear Iterative Deconvolution Filter
263
Degraded (blurring + rand. noise)
10 Iteration
20 Iteration
Degraded (motion blurr. + rand. noise)
10 Iteration
20 Iteration
Fig. 4.36 Application of the nonlinear iterative deconvolution filter by Lucy–Richardson. For the two degraded images, Lena with blurring and random noise, cameraman with blurring from linear motion and random noise, the best results are obtained, respectively, by setting 20 iterations
The maximum likelihood functional is given by the following equation in the spatial domain:
g(m.n) ˆ ˆ (4.103) f k+1 (m, n) = f k (m, n) h(−m, −n) ∗ h(m, n) ∗ fˆk (m, n) which will be satisfied when it converges to a maximum value. The symbol “∗” indicates the convolution operator and the nonlinearity is due to the ratio present in the functional. Lucy–Richardson deconvolution is an iterative procedure for recovering an estimate fˆ(m, n) of an original image f (m, n) that has been blurred by a known point spread function h(m, n). Starting from the degraded image g(m, n), from the knowledge of h(m, n), assuming the degraded image statistic with Poisson distribution (appropriate for the noise generated by the image sensors) and with an arbitrary initial image fˆ0 (which can be a constant image), the iterative process of convolution proceeds for a defined number of iterations. Like all iterative approaches, it is difficult to predict the number of iterations required to ensure the convergence of the (4.103) in a local maximum. The optimal solution depends on the complexity of the PSF impulse response and its size. For small size, convergence requires few iterations. A low number of iterations can produce unusable images because they are very blurry, on the contrary, an excessive number of iterations, in addition to the increase of calculation required, determines a considerable increase in noise.
264
4 Reconstruction of the Degraded Image: Restoration
The latter generates annoying artifacts in the image known as the ringing effect which in some cases can be attenuated with ad hoc algorithms. The optimal number of iterations and size of the PSF window are experimentally evaluated in relation to the type of image. Figure 4.36 shows the results of the LG filter applied to the usual test images degraded by blurring and random noise of the same entity to the previous figures. The best results are obtained by setting up to 20 iterations for both images while obtaining values of P S N R = 64 dB for the image Lena and P S N R = 60 dB for the image cameraman.
4.13 Blind Deconvolution In the previous paragraphs, image reconstruction methods have been described assuming knowledge of the transfer function H that caused the blurring and assuming, in several cases, the knowledge of the power spectra P f and Pη , respectively, image and noise. In this section, we deal with the problem of reconstruction when we have no knowledge of H and we only start from the degraded image. This situation occurs, for example, for images acquired in astronomical applications, where the environment between the scene and the observer (telescope) is not well known, in order to reliably evaluate the causes of degradation. In this case, a blind deconvolution approach is used to solve the general degradation Eq. (4.75). Different algorithms are developed that substantially start from an inaccurate hypothesis of the degradation model and, through an iterative process, tend to estimate a good estimate of the original image from the degraded image observed. The most used algorithm is the one based on the maximum likelihood estimation (MLE) which estimates the parameters of the degradation model by maximizing the likelihood function between the observed image and the one iteratively generated by the reconstruction model of the transfer function H . Figure 4.37 shows the functional scheme of the blind deconvolution algorithm, based on the parametric method of the maximum likelihood estimation, which operates as follows: given the degraded image G and an initial hypothesis of H we want to find, iteratively, the parameters θ that best model H , that is, θ M L = arg max p(y|θ ) θ
(4.104)
where y indicates an observation of H . At the first iteration, arbitrary parameters are assigned and a first version of the image is reconstructed with the Wiener filter, note the degraded image g and the estimated current value of H . Subsequently, on the basis of the reconstructed image, new values of the parameters θ are calculated and the process is repeated until it is satisfied by evaluating the quality of the reconstructed image with an energy-based measurement criterion such as MSE or PSNR. Figure 4.37 also shows the results of blind deconvolution for the two defocused
4.13 Blind Deconvolution
265
Observed image f(m,n)
ted stima PSF E
8
6
4
2
0 0
Wiener Filter HW(u,v)
Initial Degradation Function HB(u,v)
2
4
6
Image reconstructed GR
8
Evaluation Reconstructed Image
Reconstructed Image Estimated Function of degradation
Yes
Optimal reconstructed image
?
Wiener Filter HW(u,v)
No
Estimate degradation function with ML
Fig. 4.37 Functional scheme of blind deconvolution and results based on the algorithm for estimating the maximum likelihood by iterating the process up to 18 iterations
and random noise test images (with variance 0.001). The PSF impulse response is also reported, related to the estimated degradation function with the iterative process based on the maximum likelihood estimation.
4.14 Nonlinear Diffusion Filter The technique presented here attempts to solve the problem of removing important structures during the filtering process. For example, consider the image f (x, y) to be filtered. A classic linear diffusion process is 2 ∂ u ∂ 2u 2 (4.105) + 2 ∂t u = ∇ u = ∇, ∇u = ∂x2 ∂y with ∂t u ≡ ∂u ∂t and with t the time. So, at time t = 0, we will start from the image, i.e., u(x, y, 0) = f (x, y), while the successive approximations are given by u(x, y, t) = (G σ f)(x, y) t > 0.
(4.106)
266
4 Reconstruction of the Degraded Image: Restoration
√ with σ = 2t. The main problem with this notation is that the Gaussian filter is invariant to rotation and therefore smooths out both low frequency and high information, where important structures such as edges or discontinuities need to be preserved. Thus, this notation spreads the data in all directions and therefore, we need to move to nonlinear filters. This filter derives from the diffusion of a physical process, which balances the differences in concentration without creating or destroying the mass. This is formulated with the flow ψ which is equivalent to ψ = −D · ∇u
(4.107)
where the concentration gradient ∇u causes a flow ψ that compensates for this gradient. The relation that links ψ and ∇u is described by the D tensor.2 The mass can only be transported from the following: ∂t u = −div(ψ) = div(D · ∇u)
(4.108)
which represents the diffusion equation. If D becomes a scalar, filtering is reduced to a smooth smoothing and becomes linear. If D depends on the evolution of the image u, the resulting equation describes a nonlinear diffusion filter. If ψ and ∇u are not parallel, diffusion is called anisotropic. If instead they are parallel, then the diffusion is called isotropic and the diffusion tensor D can be replaced with a scalar. In the first version of the nonlinear filter proposed by Perona and Malik [4], it was suggested D = g(|∇u(x, y, t)|)
(4.109)
with g(•) indicating a nonnegative and monotonically decreasing function with g(0) = 1 and g(∞) = 0, for example, g(s) =
1 , 1 + (s/λ)1+α
α > 0.
(4.110)
The flow module3 |ψ| = s·g(s) is not monotone, increasing for s < λ and decreasing for s > λ. A good choice of g parameters leads us to smooth (blurring) gray levels in low-gradient areas and to enhance (sharpen) important edges at the same time. The parameter λ serves as a threshold for the gradient: low gradient is diffused, while high gradient is treated as an edge. Perona and Malik [5] have shown that their technique can be used both for restoration and for segmentation. The publications that followed the paper by Perona and Malik were numerous. Cattè [6] showed that the continuous model of Perona and Malik was ill-posed, since the diffusivity g leads to a decreasing flow s · g(s) for some value of s and that the scheme can function locally as the inverse of the heat equation that is known to be ill-posed, developing singularity of any order in a short time. Although the filter had this severe drawback, a regularization factor and the use of implicit diffusion were introduced in [6]. In particular, in the module s = |∇u|, it diffusion tensor D is a square N × N matrix different for each point u ∈ N . Its role is to rotate and scale the ∇u gradient. 3 If we consider s = |∇u| the formulation is consistent with the isotropic case of Eq. (4.107). 2 This
4.14 Nonlinear Diffusion Filter
267
50
50
50
50
100
100
100
100
150
150
150
150
200
200
200
200
250
250
250
50
100 150 200 250
(a)
50
100 150 200 250
(b)
250 50
100 150 200 250
(c)
50
100 150 200 250
(d)
Fig. 4.38 Application of the bilateral filter: a and c are the images reconstructed from the filter applied to the images b and d with σr = 0.1 and σd = 1.5
is replaced with an estimate of it |∇G σ u|, with G σ a smoothing mask.4 With this modification of the model, we get the existence and uniqueness of the solution for any σ > 0, given by (4.111) ∂t u = div(g(|∇uσ |) · ∇u) which represents a corrected version of nonlinear spatial filtering. In this way, we have solved the problem of diffusion in the vicinity of an edge, however, the noise close to it is left intact. This is because the filter so far treated is isotropic, so we must turn our attention to the anisotropic one in which the flow is generally not parallel to the gradient image.
4.15 Bilateral Filter The bilateral filter is nonlinear and considers both spatial and radiometric information (defined as pixel-intensity dynamic information) [7,8]. The filter uses a kernel function that is composed of a spatial component to establish a relationship between the distance of the pixels being processed (spatial amplitude), and one in terms of pixelintensity dynamics to relate pixels with different level configurations (photometric amplitude). The bilateral filter leaves the contours unchanged while suppressing the noise. Given f (x, y) an input image to filter, the resulting image g(x, y) is given by
f (k, l)w(x, y; k, l) (4.112) g(x, y) = kl kl w(x, y; k, l) where the function w(x, y; k, l) represents the composition of a spatial filter for smoothing difference coordinates (normally it can be a Gaussian function), given by (x − k)2 + (y − l)2 d(x, y; k, l) = exp − (4.113) 2σd2 4 Note
that the following relationship is valid: (∇G σ ) u = ∇(G σ u)
and we will now indicate this expression with ∇u.
268
4 Reconstruction of the Degraded Image: Restoration
(a)
(b)
(c)
(d)
Fig. 4.39 Results obtained with the dehazing filter: a and c the input images; b and d the reconstructed images
and a filter applied to smooth the differences in intensity (normally it can be a Gaussian function), given by f (x, y) − f (k, l)2 . (4.114) r (x, y; k, l) = exp − 2σr2 which weighs the pixels with homogeneous intensity levels more, while those with higher intensity differences weigh less. This serves to preserve the edges. So the composite filter is given by (x − k)2 + (y − l)2 f (x, y) − f (k, l)2 w(x, y; k, l) = exp − − (4.115) 2σr2 2σd2 where σd and σr are the smoothing parameters, f (x, y) and f (k, l) are, respectively, the intensities of the pixel being processed with coordinates (x, y) and of the near pixel with coordinates (k, l) under consideration included in the filter action window. Figure 4.38 shows an application of the bilateral filter on two different types of images.
4.16 Dehazing The term dehazing represents all those image processing techniques that solve the problem of image sharpness when the environmental conditions are adverse (fog, snow, rain, etc.). In the light–matter interaction, the reflection in adverse days generates the phenomenon of scattering, or the disordered reflection of light caused by the presence of water vapor particles suspended in the air, which generates a degraded image. The sharpness, the contrast, and the chromatic variation modify the quality of the image. Degradation increases with increasing distance between image plane and object. The basic idea is to exploit the differences between multiple images acquired from the same scene and with different atmospheric conditions (see Fig. 4.39). In [9], a polarizing filter was used with two images of the same scene taken with different degrees of polarization. In [10], several images of the same scene related to different weather conditions were used. Recently, however, the focus has been on developing dehazing techniques that use a single image [11–14].
References
269
References 1. H.C. Andrews, B.R. Hunt, Digital Image Restoration (Prentice Hall, Upper Saddle River, 1977) 2. R. Hufnagel, N.L. Stanley, Modulation transfer function associated with image transmission through turbulent media. Opt. Soc. Am. 54(1), 52–61 (1964) 3. A.K. Jain, Fundamentals of Digital Image Processing, 1st edn. (Prentice Hall, Upper Saddle River, 1989). ISBN 0133361659 4. P. Perona, J. Malik, Scale-space and edge detection using anisotropic diffusion, in Proceedings, IEEE Computer Society workshop on Computer Vision (1987), pp. 16–27 5. P. Perona, J. Malik, Scale-space and edge-detection using anisotropic diffusion. IEEE Trans. Pattern Anal. Mach. Intell. 12(7), 629–639 (1990) 6. F. Catté, P.-L. Lions, J.-M. Morel, T. Coll, Image selective smoothing and edge detection by nonlinear diffusion. SIAM J. Numer. Anal. 29(1), 182–193 (1990) 7. F. Durand, J. Dorsey, Fast bilateral filtering for the display of high-dynamic range images. ACM Trans. Graph. (2002) 8. C. Tomasi, R. Manduchi, Bilateral filtering for gray and color images, in Proceedings of the International Conference on Computer Vision (1998), pp. 836–846 9. Y.Y. Schechner, S.G. Narasimhan, S.K. Nayar, Instant dehazing of images using polarization, in IEEE Proceedings of Computer Vision and Pattern Recognition Conference (2001), pp. 325–332 10. S.G. Narasimhan, S.K. Nayar, Contrast restoration of weather degraded images. IEEE Trans. Pattern Anal. Mach. Intell. 25(6), 713–724 (2003) 11. R. Fattal, Single image dehazing, in SIGGRAPH (2008), pp. 1–9 12. J. Sun, K. He, X. Tang, Single image haze removal using dark channel prior, in IEEE Proceedings of Computer Vision and Pattern Recognition Conference (2009), pp. 1956–1963 13. L. Kratz, K. Nishino, Factorizing scene albedo and depth from a single foggy image, in Preprint in ICCV (2009) 14. R.T. Tan, Visibility in bad weather from a single image, in IEEE Proceedings of Computer Vision and Pattern Recognition Conference (2008), pp. 1–8
5
Image Segmentation
5.1 Introduction Generally an image contains several observed objects of the scene. For example, in a robotized assembly cell, the vision system must automatically acquire and recognize the various objects to be assembled. In this context, a first goal of the vision system is to divide the image into different regions, each of which includes only one of the objects observed to be assembled. The interest in this application, is to have an algorithm available that isolates the various objects in the image. The process of dividing the image into homogeneous regions, where all the pixels that correspond to an object are grouped together, is called segmentation. The grouping of pixels into regions is based in relation to a criterion of homogeneity that distinguishes them from one another. The criteria can, for example, be values of similarity of attributes (color, texture, etc.) or values of spatial proximity (Euclidean distance, etc.). The principles of similarity and proximity are motivated by the simple consideration that homogeneous regions derive from the projection of points of an object in the image represented by pixels that are spatially close together with similar values of gray-level. These assumptions are not always valid, and in different applications it is necessary to integrate other information in relation to the prior knowledge of the application context (application domain). The grouping of the pixels in this last case is based on comparing the hypothesized regions with regions modeled a priori. Two important clarifications must be made: 1. The segmentation of the image does not imply the classification of the regions. The segmentation algorithm only partitions the image into homogeneous regions and no information is provided to recognize the associated objects, much less about their relationship. 2. The segmentation of the image is not based on a particular physical-mathematical theory. Consequently there are different algorithms available, many of which are © Springer Nature Switzerland AG 2020 A. Distante and C. Distante, Handbook of Image Processing and Computer Vision, https://doi.org/10.1007/978-3-030-42374-2_5
271
272
5
Image Segmentation
not exhaustive and based on a heuristic approach using ad hoc methods for the various applications. The segmentation algorithms can be grouped as follows: (a) Knowledge based. Algorithms based on the global or partial knowledge of the image. Knowledge is represented for example by the histograms of some characteristics of the image. (b) Edge-based. Algorithms based on object contour detection. (c) Region-based. Algorithms based on the detection of regions associated with objects. The Edge and Region based algorithms are characterized in turn on the basis of characteristics that we will call features (luminance, texture, movement, color, etc.) used for the extraction of edges or regions. For an ideal image, the Edge and Region based algorithms operate in a dual way producing the same results. In theory, closed contours can be obtained from regions using contour tracking algorithms (also called border following). Similarly, the regions can be obtained from the contours using region filling algorithms. In real images it is difficult to derive unambiguously, exact contours of the regions and vice versa. Because of the noise, both Edge based and Region based algorithms do not provide a correct segmentation of the image. Acceptable results can be derived by combining the partial results of different algorithms and integrating the prior knowledge of the appropriately modeled scene.
5.2 Regions and Contours The set of connected pixels representing a particular feature or property of objects is called a region of an image. An image can contain several regions that represent a particular property of a single complex object or properties associated with various objects in the scene. Due to errors in the image, the correspondence between the regions and what they represent is not always correct, i.e. the objects or part of them. A contour is an ordered list of edges. The set of pixels that delimit a region constitute a closed contour. The limits of a region are not always closed.
5.3 The Segmentation Process The segmentation, or the process of grouping pixels into regions, can be defined as a method that partitions an image f (i, j) into regions R1 , R2 , . . . , Rn that meet the following conditions:
5.3 The Segmentation Process
1.
n
273
Rk = f (i, j), i.e. the whole image is partitioned into n regions (exhaustive
k=1
partition). 2. Each region Ri satisfies a defined predicate (or criterion) of homogeneity P(Ri ); that is: P(Ri ) = T r ue ∀ i = 1, . . . , n 3. Each region Ri is spatially connected. 4. Pixels belonging to confining regions Ri and R j , when considered together, do not satisfy the predicate: P(Ri ∪ R j ) = False ∀ i, j of the n regions 5. {Ri } is an exclusive partition, namely: Ri ∩ R j = ∅ with i = j The homogeneity predicate P(•) indicates the conformity of all the pixels of a region Ri to a particular model of the region itself.
5.3.1 Segmentation by Global Threshold The process of converting the gray levels of an image f (i, j) into a binary image g(i, j) is the simplest method of segmentation: 0 i f f (i, j) ≥ S g(i, j) = for dark objects (5.1) 1 i f f (i, j) < S 0 i f f (i, j) ≤ S g(i, j) = for light objects (5.2) 1 i f f (i, j) > S where S is the gray-level threshold. In this way, we have that g(i, j) = 1 for the pixels belonging to the objects and g(i, j) = 0 for the pixels belonging to the background. The segmentation realized with the (5.1) and (5.2) is called with global threshold since it depends exclusively on f (i, j). Segmentation through the gray level threshold is motivated, considering that the objects of the scene, projected in homogeneous areas (regions) in the image, are characterized by an almost constant reflectivity (or light absorption). Generally, objects are associated with a range of gray levels, while another is associated with the background. A gray-level value (also called threshold) of separation between the two intervals is calculated in order to assign the value 1 to the pixels with values included in the first interval (i.e. the objects or foreground) and to assign value 0 to the pixels with values included in the second interval (i.e. the background). In this way we obtain the binary image g(i, j) which corresponds to the segmentation with threshold. Figure 5.1c shows a simple example of segmentation that tries to binarize the image by isolating the dark object with a light background with global threshold S = 100.
274
5
Image Segmentation
5000 4000 3000 2000 1000 0 0
a)
b)
50
100
150
200
250
c)
d)
S
Fig.5.1 Segmentation with global threshold: a Original image; b Histogram; c Image (a) segmented with threshold S = 100 according to the (5.1); d correct segmentation with the known thresholds S1 = 60 and S2 = 125 according to (5.3)
If the interval of gray levels [S1 , S2 ] associated with the objects is known, then the binary image is thus obtained (see Fig. 5.1d): 1 i f S1 ≤ f (i, j) ≤ S2 (5.3) g(i, j) = 0 other wise If several (disjoint) intervals of gray levels are known [Sk , Sk+1 ] associated with the corresponding objects Ok , a more general segmentation is obtained: k i f f (i, j) ∈ [Sk , Sk+1 ] g(i, j) = (5.4) 0 other wise for k = 1, . . . , n objects. For example, microscope images, for blood analysis, are normally segmented with the previous equation, where a given gray level interval is associated with the cytoplasm, the background has higher gray levels and the cells of the nucleus are darker (see Fig. 5.2). Segmentation by threshold can also be used to extract the edges corresponding to objects in the scene. In the hypothesis that these objects are dark with respect to a lighter background, it can be assumed that a range of gray levels can only include levels that belong between the background and the boundaries of each object. If we indicate with S the interval that includes only the gray levels of the contours of the objects, we obtain the following segmented image: 1 i f f (i, j) ∈ S g(i, j) = (5.5) 0 other wise A partition of the image to n levels can be obtained if a given value is associated with each gray level interval k: k i f f (i, j) ∈ [Sk , Sk+1 ] (5.6) g(i, j) = 0 other wise for k = 1, . . . , n objects. The thresholds considered above are interactively calculated by the user. Alternatively, the thresholds can be calculated automatically based on prior knowledge, for example, the number of objects in the image, their size and the distribution of gray levels of objects with respect to the background.
5.4 Segmentation Methods by Local Threshold
275
4000 3500 3000 2500 2000 1500 1000 500 0
a)
b)
c)
d)
0
50
100
150
200
250
Fig. 5.2 Segmentation with different intervals of separation of gray levels: a Original Image; b Histogram (c) Isolated central nucleus of the cells with gray level range (10–75); Isolated the base of cells with interval (150–230)
5.4 Segmentation Methods by Local Threshold Different segmentation methods have been developed for the automatic calculation of the threshold by analyzing the distribution of the gray levels of the image and the a priori knowledge of the objects of interest. In the segmentation with global threshold, only the frequency information of the gray levels is used f (i, j) and the selected threshold is influenced by the size of the objects with respect to the background. The method is very simple with low computational cost. If the threshold S can be determined also using local information p(i, j) the segmentation improves and is furthermore necessary in conditions of non-uniform illumination and in the presence of noise. The typology of local information p(i, j) characterizes the various methods of segmentation with local threshold that will be described in the following paragraphs. A dynamic threshold can be considered not only in conditions of non-uniform lighting but above all when the environmental conditions change by dynamically altering the lighting.
5.4.1 Method Based on the Objects/Background Ratio This method is based on a priori knowledge of the percentage P of area of the image occupied by the object to be exalted with respect to the background. In different applications, this situation often occurs, and the distribution of gray levels appears as in Fig. 5.3 in the context of binarizing the image. A typical application, in which
276
5
Image Segmentation
Freq
L
P%
Fig. 5.3 Typical histogram when two objects are well separated
5000 4000 3000 2000 1000 0 0
a)
50
100
150
200
b)
250
c)
Fig. 5.4 Segmentation with the calculation of the threshold based on the knowledge of the relationship between the area occupied by the objects and the background: a Image with the text at gray levels; b Histogram; c Binarized image estimating the percentage P = 14% of the character area corresponding to a threshold S = 200
there is a prior estimate of the area of the objects of interest, is in the digitization of printed pages of a text. In this case, the objects, that is, the characters, are black compared to the white background. The expected histogram is of the bimodal type and with a dominance of lighter pixels belonging to the background. The choice of the threshold P, in terms of percentage of the area corresponding to the darker gray levels (i.e. the characters), influences the results of the segmentation. A low threshold produces a segmentation with partially or completely erased characters, while the choice of a high threshold produces a segmentation with artifacted and modified characters from the original form (see Fig. 5.4).
5.4.2 Method Based on Histogram Analysis In some applications, it is effective to analyze the gray level histogram of an image, to calculate appropriate thresholds, in order to identify gray level ranges that identify the background (dark or light) and objects in the scene with a good approximation. The histograms shown in Fig. 5.5a show two peaks corresponding respectively to an object of the scene, with a distribution of gray levels approximated by a Gaussian (μ1 ,σ 1 ), and to the background with even Gaussian distribution (μ S ,σ S ). The variance of the distributions in the two figures depends substantially on the noise, on the brightness
5.4 Segmentation Methods by Local Threshold
277
a)
b)
H(l)
H(l) H(S)
0
S
ls
255
0
S
255
Fig. 5.5 Segmentation by histogram analysis
of the object with respect to the background and on the lighting conditions that can generate more or less accentuated shadows to confuse the object with the background. If the two peaks are well accentuated and separated, and in the valley of the histogram only a few pixels fall, the segmentation of the image is not much influenced by the value of the selected threshold (see Fig. 5.5a). In this case, the automatic calculation of the threshold is easily carried out by analyzing the histogram H (l) from which the local maxima H (l0 ) and H (ls ) and the minimum H (S) are calculated. S is the value of the optimal threshold estimated between the gray levelsl0 and ls respectively corresponding to the peak of the object and the background. The distance between the peaks is not considered. This method can be generalized for n objects with the Gaussian distribution of gray levels (μ1 ,σ 1 ),.....,(μn , ,σ n ) and with the background (μ S ,σ S ). In this case we need to look for n thresholds S1 , . . . , Sn analyzing the histogram H (l) of the image that will present n + 1 peaks (including the background peak) and n valleys (see Fig. 5.5b).
5.4.3 Method Based on the Gradient and Laplacian In applications where the distributions of the gray levels of the object and of the background overlap, the segmentation of the image is strongly influenced by the choice of the threshold. The histogram does not present a perfectly bimodal pattern and consequently any choice of the S threshold identifies pixels that may belong to the object and the background. This is determined by the intrinsic nature of the objects (they exhibit a non-uniform reflectance) which with the information only f (i, j) are not easily separable from the background especially in the presence of noise. In analogy to the problems encountered for restoration, even for segmentation, noise together with non-uniform reflectance influence the shape of the histograms. To minimize this inconvenience, you can use the gradient and Laplacian information of the image based on the analysis of the pixels that are on the border between object and background. In this way we use the information of the gradient (described in the Sect. 1.12) to check whether the pixel is contour element (high value of the gradient)
278
5
Image Segmentation
or not while the information of the Laplacian (see Sect. 1.12) indicates locally the transition of gray levels i.e. if compared to the contour we are on the object or background side. With this strategy, segmentation no longer depends on the size of the objects relative to the background. The segmentation procedure, with this approach, simultaneously analyzes the information of the gradient ∇ f (i, j) and of the Laplacian ∇ 2 f (i, j) obtaining 3 information levels of the image s(i, j) as follows: ⎧ ⎨ 0 i f ∇ f (i, j) < S s(i, j) = + i f ∇ f (i, j) ≥ S and ∇ 2 f (i, j) ≥ 0 (5.7) ⎩ − i f ∇ f (i, j) ≥ S and ∇ 2 f (i, j) < 0 where S is the global gradient threshold and 0, +, − are three distinct gray levels or local information from the Laplacian. For dark objects on a light background in s(i, j) we have the following: (a) All the non-contour pixels are assigned with a value of zero, that is, they satisfy the first condition in the (5.7). (b) All the pixels of the contour that satisfy the condition of the gradient, i.e. the first part of the other two conditions of the (5.7), take the plus sign if they are on the dark side and the minus sign if they are on the light side of the contour itself. For clear objects on a dark background in s(i, j) we have the following result: (a) All the non-contour elements are still assigned a value of zero, that is, they satisfy the first condition in the (5.7). (b) All contour pixels that satisfy the gradient condition, that is, the first part of the other two conditions of the (5.7), will have the reversed sign, that is, the minus sign if they are on the dark side and the plus sign if they are on the light side of the same contour. In the hypothesis of a dark object on a light background, the (5.7) produces as result s(i, j) of which each line reproduces a sequence of symbols with the following characteristic: (· · · )(−+) (0 or +) (+−)(· · · )
1
0
1
where the sequence of the “−+” symbols indicate the transition from light to dark while the “+−” sequence indicates the transaction from dark to clear (see Fig. 5.6).
5.4.4 Method Based on Iterative Threshold Seclection For those applications where the image is not known, the segmentation threshold S can be calculated automatically by evaluating the statistical parameters of the gray levels distribution of the object and background pixels. The method is based on
5.4 Segmentation Methods by Local Threshold
279
c)
d) a)
b)
Fig. 5.6 Extraction of texts through segmentation with global threshold based on gradient and local information based on Laplacian: a and c Original images with text; b and d result of the segmentation applying the (5.7) and the thresholds 0.1 and 0.15 respectively on the gradient image normalized between 0 and 1
the approximation of the histogram of the image using the weighted average of two probability densities with normal distribution. From the histogram of the image H (l) we derive the normalized histogram given by: H (l) p(l) = 255 i=0 H (i)
f or l = 0, 255
(5.8)
where p(l) represents the probability of occurrence of each gray level l. For each value of the threshold S, two pixel groups are identified, respectively of the object and of the background whose histograms of the gray levels are approximated with two normal distributions N (l, μ0 , σ0 ) and N (l, μs , σs ). The threshold S is chosen with a gray level corresponding to the minimum probability p(s) between the peaks of the two normal distributions, and consequently there is a segmentation with the minimum error, i.e. there is the minimum number of pixels not belonging to the object or to the background. A simple algorithm for calculating the automatic threshold in an iterative way is the following: 1. Select an initial value of the threshold S corresponding to the average gray level of the entire image. 2. At the step tth and in correspondence with the threshold S, evaluated in precedence, calculates the average μts and μt0 of the background and of the object:
μtS
=
(i, j)∈Backgr ound
f (i, j)
Number of pixels background
μtO
=
(i, j)∈Object
f (i, j)
Number of object pixels
(5.9)
3. Calculate the new threshold S: μtS + μtO (5.10) 2 which defines a new background/object partition. 4. Terminates the algorithm if μtS and μt0 are unchanged from the iteration (t − 1)th; Otherwise go back to step 2. S=
280
5
Image Segmentation
5000 4000 3000 2000
S=154
1000 0 0
a)
50
100
150
200
250
b)
c)
Fig. 5.7 Iterative method for the calculation of the automatic threshold
This iterative method expects to segment the image into two regions. Fig. 5.7 shows the result of the segmentation to extract the characters of a typescript. The algorithm converges after 5 iterations calculating the threshold S = 154.
5.4.5 Method Based on Inter-class Maximum Variance - Otsu The difficulty of the previous method consists in estimating the parameters of normal distributions and in the assumption that the gray levels have a probability density approximate to the normal distribution. These difficulties are partly mitigated if the one that maximizes the variance of the gray levels of the object and the background (the two classes of interest of the image) is chosen as the optimal threshold. In other words, a functional is defined, based on the analysis of variance, to evaluate the separation of classes in statistical terms. Figure 5.8b shows the typical histogram of an image with the distribution of the gray levels H (l) calculated with the (5.8) where the two classes relative to the object and to the background are highlighted with the overlapping Gaussian curves of the probability distribution of occurrence of pixels p(l), l = 0, . . . , L − 1 where L indicates the number of gray levels present in the image. The objective is to determine a threshold S(l) = S O T , l = 0, . . . , L − 1 of separation of the gray levels of the two classes which maximizes the interclass variance. The statistical measure that best characterizes the homogeneity of the classes is the variance. A high homogeneity of the pixels of a class implies a low value of the variance and vice versa high values of the variance implies a low homogeneity of a class. Now suppose you choose a threshold S thus dividing the gray levels of the image into two classes, C1 the set of pixels with levels of gray between 0 and S, and with C2 the remaining set of pixels with levels between S + 1 and L − 1. In correspondence of a threshold S, the probability that a pixel belongs to the class C1 (for example the object) or to the class C2 (for example the background) is given by: P1 (S) =
S l=0
p(l)
P2 (S) =
L−1 l=S+1
p(l) = 1 − P1 (S)
(5.11)
5.4 Segmentation Methods by Local Threshold
a)
b)
281
c)
2000 1800 1600 1400 1200 1000 800 600 400 200 0
S
d)
0
50
100
150
200
250
Fig. 5.8 Segmentation with optimal threshold based on the maximum inter-class variance (Otsu method): a Original image; b Histogram of the image in (a); c Segmentation into two classes with global threshold S = 129 and separability criterion σ B2 (S)/σG2 = 0.922; d Segmentation in three classes finding an additional threshold (S1 = 128 and S2 = 221 and separability criterion 0.954) in the class represented by the right Gaussian in the histogram in (b)
where P1 and P2 are the a priori probabilities that the pixels belong to the object (class C1 ) or to the background (class C2 ) respectively. With μ1 (S) and μ2 (S) we indicate the averages of the first and second classes generated by the threshold S respectively defined below: μ1 (S) =
S
l · P(l|C1 ) =
l=0
μ2 (S) =
L−1 l=S+1
S l=0
l · P(l|C2 ) =
P(C1 |l)P(l) 1 l · p(l) = P(C1 ) P1 (S) S
l·
(5.12)
l=0
L−1
l·
l=S+1
L−1 P(C2 |l)P(l) 1 l · p(l) (5.13) = P(C2 ) P2 (S) l=S+1
where P(C1 |l) = 1, P(l) = p(l) and P(C1 ) = P1 , and similarly we have for the class C2 . The global average μG of the whole image is given by: μG =
L−1
l · p(l)
(5.14)
l=0
The global variance σG2 of the image is given by the following: σG2 =
L−1 (l − μG )2 p(l) l=0
(5.15)
282
5
Image Segmentation
The pixel variances related to the two classes C1 and C2 , are calculated with the following equations: S
σ12 (S) =
[l − μ1 (S)]2 p(l)
σ22 (S) =
l=0
L−1
[l − μ2 (S)]2 p(l)
(5.16)
l=S+1
that depend only on the threshold S. The method proposed by Otsu for the automatic and optimal calculation of the S threshold consists in minimizing the intra-class 2 of the two sets of pixels generated by the threshold S variance (within class) σW which it is proved to correspond to maximizing the inter-class variance (between class) σ B2 . In other words, the optimal choice of the threshold consists in having a minimum variance between the classes which corresponds to maximizing the dif2 (S) is ference between the averages of the two classes. The intra-class variance σW defined as the weighted sum of the variances of the classes with the relative probability of belonging of a pixel to C1 and C2 classes, and is obtained with the following: 2 (S) = P1 (S)σ12 (S) + P2 (S)σ22 (S) σW
(5.17)
Using (5.17) we could analyze the entire dynamics of the L levels of grays and 2 (S). find the value of the threshold S which minimizes the intra-class variance σW Instead, it is convenient to proceed with an approach that optimizes the calculation of S. This is possible by finding a functional that relates the global variance σG2 which 2 and σ 2 . This will allow then to find is independent of S and the two variances σW B the optimal threshold S through a faster iterative procedure. Rewriting the (5.15) and with some artifice we get the following: σG2 =
S
[l − μ1 (S) + μ1 (S) − μG ]2 p(l) +
l=0
=
S
L−1
[l − μ2 (S) + μ2 (S) − μG ]2 p(l)
l=S+1
{[l − μ1 (S)]2 + 2[l − μ1 (S)][μ1 (S) − μG ] + [μ1 (S) − μG ]2 } p(l)
(5.18)
l=0 L+1
+
{[l − μ2 (S)]2 + 2[l − μ2 (S)][μ2 (S) − μG ] + [μ2 (S) − μG ]2 } p(l)
l=S+1
Considering that in (5.18) the following expressions are null: S
[l − μ1 (S)][μ1 (S) − μG ] p(l) = 0
l=0
L+1
[l − μ2 (S)][μ2 (S) − μG ] p(l) = 0
l=S+1
Equation (5.18) is reduced to the following: 2 = σG
S
[l − μ1 (S)]2 p(l) + [μ1 (S) − μG ]2 P1 (S) +
l=0
L−1
[l − μ2 (S)]2 p(l) + [μ2 (S) − μG ]2 P2 (S)
l=S+1
= [P1 (S)σ12 (S) + P2 (S)σ22 (S)] + {[P1 (S)[μ1 (S) − μG ]2 + P2 (S)[μ2 (S) − μG ]2 }
2 (S) I ntra−class variance σW
(5.19)
2 (S) I nter −class variance σ B
It is observed that in the last expression of (5.19), the first term [•] represents the intra-class variance previously defined by the (5.17) while the second term {•} defines
5.4 Segmentation Methods by Local Threshold
283
the inter-class variance indicated with σ B2 (S). The latter represents the weighted sum of the square of the distances between the average of each group and the total average. The total average can be written as follows: μG = P1 (S)μ1 (S) + P2 (S)μ2 (S)
(5.20)
Equation (5.19) can be simplified by eliminating μG , remembering that P2 (S) = 1 − P1 (S) and by replacing we get: 2 σG2 = σW (S) + P1 (S)[1 − P1 (S)][μ1 (S) − μ2 (S)]2
(5.21)
I nter −class variance σ B2 (S)
We have thus found the functional that relates the global variance σG2 (which is 2 (S) and σ 2 (S). In particular, the (5.21) states independent of S) with the variances σW B that for any value of S the global variance is the sum of the intra-class variance and the inter-class variance where the latter is the weighted sum (from the probabilities of the classes) of squares distances between class averages and the global average (Eq. 5.19). Analyzing the Eq. (5.21) we have that the global variance σG remains constant for each chosen value of S and its variation in the interval of the gray levels L has the effect of modifying the contributions of the two variances σW (S) and σ B (S) in the sense that while one increases the other decreases and vice versa. It follows: minimizing intra-class variance corresponds to maximizing inter-class variance. Choosing instead to maximize the inter-class variance we have the advantage of calculating the quantities of σ B (S) through an iterative procedure by examining the entire dynamics of the gray levels. From (5.21) we can extract the expression of the inter-class variance to be maximized that we can report in the following form: σ B2 (S) = P1 (S)[1 − P1 (S)][μ1 (S) − μ2 (S)]2 =
[μG P1 (S) − μ1 (S)]2 P1 (S)[(1 − P1 (S)]
(5.22)
The optimal threshold S is determined by analyzing the values of σ B (S), calculated with the (5.22) for the L gray levels, which satisfy the following: σ B2 (S ) = arg max σ B2 (S)
(5.23)
0≤S≤L−1
In summary, the Otsu algorithm expects to calculate the optimal threshold based exclusively on the histogram of the image H (l) from which, for S = 0, 1, . . . , L − 1 the probabilities P1 and P2 are determined with the equations (5.11), the averages μ1 , μ2 and μG , the inter-class variance σ B2 with the (5.22), the threshold S with la (5.23). If the maximum is not unique, the average value of the maximums found is chosen as the optimal threshold. In some applications, this method was used for
284
5
Image Segmentation
segmentation with multiple classes. In the hypothesis of 3 classes, the inter-class variance in the form indicated in (5.19) would result: σ B2 (S1 , S2 ) = P1 (S1 , S2 )[μ1 (S1 , S2 ) − μG ]2 + P2 (S1 , S2 )[μ2 (S1 , S2 ) − μG ]2 + P3 (S1 , S2 )[μ3 (S1 , S2 ) − μG ]
(5.24)
2
where P3 and μ3 respectively indicate the probability and the average of the class 3 associated with the two thresholds S1 and S2 . The corresponding optimal thresholds are then calculated for the L gray levels that satisfy the following: σ B2 (S1 , S2 ) =
arg max
0≤S1 ≤S2 ≤L−1
σ B2 (S1 , S2 )
(5.25)
Figure 5.8c shows the results of this segmentation method applied to an image that presents two dominant classes, the cytoplasm objects (the left Gaussian in the figure) and the background (the right Gaussian). The algorithm was applied a second time applied to the class of pixels with greater population, thus obtaining a third class (see Fig. 5.8d). The Otsu method has the disadvantage of assuming that the histogram is bimodal. Furthermore, if the two classes of pixels are very different in size you can have different maximums of the variance σ B2 and the mediated one chosen is not always the global one. Finally, it is not effective in non-uniform lighting conditions.
5.4.6 Method Based on Adaptive Threshold In applications where the image is acquired in non-uniform light conditions, the thresholds selected with the previous methods do not produce a good segmentation. In this context, a single threshold value S related to the entire image does not produce good segmentation results. The segmentation process is strongly influenced by the appearance of objects i.e. the reflectance properties (not easily modeled from the physical-mathematical point of view, see Chap. 2 Vol. I Radiometric Model) and non-uniform lighting conditions. In addition, shadows generated by the same objects further modify the appearance of objects in the captured image. In these cases, even the simple binarization of the image, with the above mentioned methods, are not appropriate.
5.4.6.1 Method Based on Decomposition Into Sub-images In these cases, a heuristic method based on searching for the adaptive thresholds may be useful. It consists of splitting the image into M × M square sub-images that are independently binarized by calculating the respective thresholds with the previous global thresholding methods. Figure 5.9 shows the results of this method where for each (i, j)th sub-image (with 1 ≤ i, j ≤ M and not overlapped), the thresholds Si, j are searched for their binarization.
5.4 Segmentation Methods by Local Threshold
285
600 400
1000
200
500
0
0 0
Histogram entire image
3000
0.5
1
0
0.5
1
Crit. Sep. 0.92003
2000 1000 0
Crit. Sep. 0.97693
0
0.5
1
600 400
1000
200
500
0
0
0
0.5
1
0
0.5
1
Crit. Sep. 0.81196
Segmented image Crit. Sep. 0.87329
Fig. 5.9 Segmentation with adaptive threshold (based on decomposition into sub-images) in the case of image that is not tractable with regular global threshold
5.4.6.2 Method Based on Background Normalization Another proposed method is based on the background normalization. Before starting the acquisition of the objects, the image of the background is determined which models the uneven lighting conditions. The determination of the background can be done analytically if one knows the function that models the brightness of the acquisition surface or experimentally acquiring directly the image of the background of a clear uniform surface (normally white) illuminated in the same conditions of acquisition of the objects. In this way the image of the background captures the conditions of non-uniform illumination assuming constant the level of reflectivity of the acquisition surface. A good approximation of the non-uniform background image g(m, n) is obtained less than a constant factor k as follows: g(m, n) = k · i(m, n)
(5.26)
where i(m, n) represents the light source and the constant k depends on the type of surface. Note the background image that captures the non-uniform illumination
286
5
Uniform original .image
. Histogram Original image.
Non-uniform background
Image Segmentation
f(m,n)=i(m,n)*r(m.n)
Histog. of f(m,n) 3000
4000
2000
2000
1000
r(m,n)
g(m,n)=k*i(m,n) 0
0 0
100
0
200
0.5
1
Histog. of h(m,n) 4000
a b c d e f g h i
2000 0
Binarization S=0.4
h(m,n)=f(m,n)/g(m,n)
0
0.5
1
Binarization S=0.49
Fig.5.10 Segmentation with background normalization in the context of non-uniform illumination. a Image acquired with uniform illumination (objects reflectance and acquisition plan); b Histogram of (a); c Uneven background acquired experimentally; d Object in (a) but acquired with non-uniform illumination; e Histogram of (d); f Automatic binarization with S = 0.45 threshold; g Image (d) normalized against background (c); h Histogram of (g); i Binarized image obtained from (g) with global threshold S = 0.49
i(m, n), according to the image formation model (see Sect. 5.7 Vol. I), the acquired image f (m, n) with the objects will be influenced both by the non-uniform illumination and by the reflectivity r (m, n) of the same objects, and will be given by: f (m, n) = r (m, n) · i(m, n)
(5.27)
Figure 5.10 shows a real acquisition situation (for example, the acquisition plan of a robotic cell) where (a) represents the image in uniform lighting conditions and constitutes the reflectance information with ( b) the bimodal histogram representing the well-separated objects and surfaces. (c) indicates the non-uniform background and (d) represents the image f (m, n) obtained with the (5.27) that approximates the captured image in non-uniform light conditions i(m, n). The histogram (e) of the image (d) is no longer bimodal with the consequent non-separability of the object from the background as reported in ( f ). Note an approximation of the background with the Eq. (5.26), instead of segmenting the image f (m, n), we can derive its normalized version h(m, n) with respect to the background, as follows: h(m, n) =
f (m, n) r (m, n) = g(m, n) k
(5.28)
It follows that, if the image r (m, n) was segmentable with the threshold S, the normalized image h(m, n) results binarizable with the threshold S/k. In the example of Fig. 5.10 is shown, in (g) the normalized image h(m, n) calculated with the (5.28), in (h) the relative bimodal histogram and in (i) the result of the segmentation of the image in (g). Similar results are obtained by directly subtracting the image of the background g(m, n) acquired previously mediated on N images of the same f (m, n) image. This would have the advantage of attenuating any noise generated by the non-uniform operation of the photosensors (instability of the individual photosensors).
5.4 Segmentation Methods by Local Threshold
287
5.4.6.3 Method Based on Local Statistical Properties A further adaptive threshold method is one based on the local properties of the image. In essence, a threshold is calculated for each pixel of the image based on statistical information (average, variance, . . .) calculated in a neighborhood (for example, a window of 3 × 3) related to the global mean and variance information of the entire Image. Local thresholds S(m, n) can be calculated as the follows: S(m, n) = aσ L (m, n) + bμ L (m, n)
S(m, n) = aσ L (m, n) + bμG
(5.29)
where the constants a and b weigh how to balance the local standard deviation σ L (m, n) with respect to the local average μ L (m, n) or the global one mu G . The binary image g(m, n) is obtained by calculating the local threshold in each (m, n) pixel with the following criteria: 1 i f f (m, n) > S(m, n) g(m, n) = (5.30) 0 other wise where the local thresholds S(m, n) are calculated with the (5.29). Another way is given by the following relationship: 1 i f f (m, n) > aσ L (m, n) AN D f (m, n) > bμG (5.31) g(m, n) = 0 other wise This adaptive local threshold method is particularly effective for attenuating the instability of the photosensors thus obtaining the binarization of the image with fewer artifacts. Moreover it better highlights the small structures present in the images, not visible with the segmentation based on global thresholds.
5.4.7 Method Based on Multi-band Threshold for Color and Multi-spectral Images For some applications, where multi-spectral and color images are used, segmentation can be effective by recursively operating on the intensity histogram of each component (multi-spectral or color) for searching thresholds at different levels. This method consists in initially partitioning a component image in a dark region and in a lighter one by locating a local minimum between two peaks of the intensity histogram. Subsequently, separate histograms are calculated for each region. The process continues recursively until there are regions with only one peak of its histograms. A segmentation algorithm based on the recursive search of thresholds is the following: 1. Consider the component image f (i, j) as a single region. 2. Compute the histogram H (l) of intensity with l = 0, 255. Look for the most significant peak and determine two thresholds as a minimum local on the sides of the peak. 3. Partition the region into sub-regions in relation to the thresholds calculated in step (2).
288
5
Image Segmentation
4. Repeat steps (2) and (3) for each image region until the associated histograms contain only a significant peak. The previous algorithm can be improved considering in step (2) also the leveled histogram calculated as follows: H (l) =
L 1 H (l + i) 2L + 1
(5.32)
i=−L
where L represents the number of intensity levels involved locally in the histogram smoothing. To extend this method for color or multispectral images, the previous algorithm would be modified to repeat steps (2) and (3) on all the color and spectral components of the image and adding the following step: 3.1 All segmented (color or spectral) components are superimposed into a multicomponent image. The process of segmentation continues on existing regions in the multi-component image. For this type of images (color and multispectral) the pixel p(m, n) is considered as a vector variable at N -dimensions where the components pi , i = 1, . . . , N represent the color channel or the multispectral components. In this case the thresholding methods can be based on similarity criteria by grouping the pixels (in cluster) in the N dimensional space. We can apply a thresholding method based on the distance criterion of each pixel p(m, n) from a given vector v and a threshold d0 (in this case the threshold represents an Euclidean distance) we would have: 1 i f d[p(m, n), v] < d0 (5.33) g(m, n) = 0 other wise where the distance d is calculated as d(p, v) = [(p − v)T (p − v)]1/2 . In the case of color images, the components are represented by the Red Green and Blue (RGB) components, or by the derived color spaces such as H S I (Hue Saturation and Intensity) space, H SV (Hue Saturation and Value) space and other color spaces described in the Chap. 3 Vol. I on Color. In the case of multispectral images, different spectral components (in the visible 450 ÷ 690 nm and in the infrared 760 ÷ 2350 nm) are available of the scene which are chosen to provide information for the different structures of the territory (crops, rocks, rivers, woods, etc.). In the case of image segmentation with multicomponents, it is necessary to avoid false segmentations for homogeneous structures, introducing a strong fragmentation of the regions. The segmented multicomponent image must be verified by a human observer to see if the segmentation process with multi-level thresholds leads to acceptable results. A good segmentation can be obtained by reducing the dimensionality of the color components or the spectral ones by transforming them to the main components (Karhunen–Loeve [1,2]). While the interactive analysis with the one-dimensional
5.4 Segmentation Methods by Local Threshold
289
3000 2500
4
2000 1500 1000 500 0 0
a)
Original
b)
50
100
150
200
Histogram First Component
250
c) Seg. 3 classes First Comp. d) Seg. 3 classes Hue Comp.
Fig. 5.11 Segmentation with multilevel threshold for color images: a Original image, b Histogram relative to the first principal component; c Result of the segmentation with three classes of the first principal component of the image; d Result of segmentation with three classes applied to the component H ue of (a) converted into the space H S I
histogram allows a direct view of the pixel distribution, with the multicomponent images the pixel distribution in the clusters can be observed only with the 2D and 3D representation respectively of 2 or 3 components (for example using the principal components). The segmentation of multispectral images is more effective using decision methods of statistical theory for object identification (classification) that will be described in the Chapter on Clustering Vol. III. Figure 5.11 shows an example of segmentation with multiple threshold levels applied to the pepper s color image. From the 3 RG B components the first principal component was calculated (the relative histogram in (b)) on which the segmentation was applied using the method described above. The result is shown in Fig. 5.11c while in (d) the result of the segmentation applied to the H H ue component of the H S I space is reported. As can be seen in Fig. 5.12, where the histograms of different color models of the pepper s image are illustrated, the H ue component is more selective for the segmentation of color images. A further advantage of operating with the H S I model is in applications where light conditions become critical. In these cases, before applying the segmentation algorithm, the average level of the most appropriate saturation component is checked first to evaluate whether the lighting conditions are optimal to apply the segmentation with the H component.
5.5 Segmentation Based on Contour Extraction The segmentation of an image can be realized on the basis of the contours that delimit the objects of the scene. The contours of an image can be extracted using Edging algorithms (see Chap. 1) which, as is known, highlight all the intensity discontinuities (color, gray level, texture, etc.) present in the image. The results of the Edging algorithms cannot be used directly for segmentation, since the boundaries that delimit the homogeneous regions have different interruptions due to the non-uniform illumination of the scene, due to the effect of noise,
290
5
RGB Image
R - Red
G - Green
3000 2000
2000
1000
1000
0
HSI Image
B - Blu 4000
3000
2000
0 0
100
200
0 0
H - Hue
6000
100
200
0
S - Saturation
4000
100
200
I - Intensity 3000
4000
2000
2000 2000
1000
0
0 0
HSV Image
Image Segmentation
0.5
1
H - Hue
0 0
0.5
1
4000
0.5
1
V - Value 3000
4000
2000
2000
2000
0
S - Saturation
1000 0
0
0 0
0.5
1
0
0.5
1
0
0.5
1
Fig. 5.12 Relative histograms of the pepper s image in the RGB, HSI and H SV color spaces
occlusions, shadows, etc. To eliminate the problem of edge interruption, ad hoc algorithms are required that completely reconstruct the contours of each homogeneous region. The contour extraction algorithms use various methods such as: (a) (b) (c) (d) (e) (f)
Contour tracking (also called border following). Pixel connectivity. Connection of edge fragments (link-edge). Edge approximation (curve fitting). Hough transform. ···
Contour-based segmentation methods present some difficulties in the context of objects that overlap or touch, and when the same contours that delimit them are frequently interrupted due to noise and uneven lighting. In such circumstances ad hoc segmentation algorithms are developed.
5.5.1 Edge Following Segmentation of the image based on contour extraction involves the following steps: 1. Edge-detection. 2. Edge Following, starting from the results (map of the edges and that of their directions) of step 1 and using the information in the vicinity of each edge.
5.5 Segmentation Based on Contour Extraction
291
3. Edge-linking, connection of interrupted contours. 4. Region Filling, filling of bounded regions. The algorithms associated with the first step have already been described in the Edging chapter. The algorithms of the second step, which concern the contour tracking have been previously analyzed for binary images in the context of the contour coding (see Sect. 7.3 Vol. I). In this case the regions are already defined (for example with threshold segmentation methods) and the boundary tracking algorithm proposes to approximate the boundary of each region with paths based on 4-connectivity or 8-connectivity. The same contour tracking algorithm can be modified to handle gray level images where regions are not yet defined. The contour is represented by a simple path (path made of non-repeated pixels each having no more than two adjacent pixels) of pixels with high value of the gradient in the gradient map. In practice, the algorithm examines the gradient image line by line, analyzing each pixel with a maximum gradient value (or when it exceeds a certain threshold value). Once this pixel is found, the algorithm tries to extract the contour of the homogeneous area, following a pixel path with maximum gradient value until the starting pixel is encountered. In this case the contour obtained is described by a closed path. The search of the pixels of the contour is based on the grouping of pixels with maximum value of the gradient that satisfy paths with 4-connectivity or with 8connectivity and on the basis of the most probable direction of continuation of the contour. In essence, the gradient and direction information of the extracted edges are used with the edging algorithms. A contour tracking algorithm for grayscale images can be summarized as follows: 1. Look for a pixel of the contour. Analyze the gradient image line by line until the Pi (x, y) pixel is encountered with maximum gradient value (or with gradient value that exceeds a predefined threshold value). 2. Search for the next pixel. Let P j (x, y) be the pixel adjacent to Pi (x, y) with maximum gradient value and with the same direction θ (Pi ). P j is an element of the contour and step 2) is repeated. Otherwise proceed to the next step. 3. Examine the pixels adjacent to P j . The average of the pixels adjacent to P j , corresponding to the window 3 × 3 centered on P j , is calculated. The average result is compared with a predefined gray gradient value in order to evaluate whether P j is inside or outside the region under examination. 4. Continue tracking the contour. The Pk pixel is selected as an element adjacent to Pi in the direction θ (Pi ) ± π/4. The choice of the sign depends on the result of the previous step. If the Pk pixel is a new element of the contour, the tracking of the same continues proceeding with step 2). If Pk is not an element of the contour it looks for a new pixel of contour Pi proceeding with step 1).
292
5
a)
b)
Image Segmentation
c)
Fig. 5.13 Contour extraction and connection of edge fragments: a Original image; b Contour extraction; c Connection of contour breaks
7 5 4 PA
3
5 4
1 4 5
1 3
7
6 8
6
3 4
2 3
PB
PB P
PA
P
1 3 4
Gradient direction Fig. 5.14 Representation of the Edge Graph of an image and choice of optimal path through the graph evaluating direction and module of the gradient
The contour extracted with the previous algorithm can be described geometrically by means of straight segments and by curve sections (see Fig. 5.13). The formal description of the contour representation will be analyzed below.
5.5.2 Connection of Broken Contour Sections In several applications, contour extraction algorithms are forced to operate with images that show different interruptions in the contours due to noise. In this context, algorithms are required for connecting the interrupted contours (Edge Linking) in order to produce closed contours. Linear and nonlinear approximation methods can be used if it is possible to describe the contour with straight and curvilinear segments. This can be simplified if we have a priori knowledge of the shape of the boundary to be searched. In the example of Fig. 5.14, the contour breaks in the P2 P3 and P4 P5 sections can be approximated by sections of a parabolic curve, while the sections P6 P7 and P8 P1 are approximated with straight sections. Several edge-linking algorithms have been developed for ad hoc applications on heuristic bases. An alternative way to solve the problem is to consider a contour as an optimal path P1 , P2 , . . . , Pn of a graph whose nodes represent the pixels of the image and the links represent the connectivity levels of two adjacent pixels (Pi , P j ). In other
5.5 Segmentation Based on Contour Extraction
293
words, the contour tracking algorithm is reformulated as the best path search algorithm for linking the start and end pixels of the contour. This approach, although computationally expensive, may be necessary in the context of images with noise. Let us recall some definitions on the graphs. A graph G = (P, A) consists of a non-empty set of nodes {Pk , k = 1, n} and of a set A of pairs of distinct nodes (Pi , P j ) not ordered. Each pair is called a link (or edge). In the pair (Pi , P j ), if the link is directed from node Pi to node P j , then P j is also called node expansion. A graph is characterized with different levels of nodes. The level 0 of the graph coincides with the single starting node of the graph; the last level of the graph contains the set of final nodes (goals). Between the extreme levels there are intermediate ones. For each link (Pi , P j ) a cost may be associated. A sequence of nodes P1 , P2 , . . . , Pm , where the generic node Pi is the successor of Pi−1 , is called path from P1 to Pm whose cost is defined by: Cm =
m−1
g(Pi , Pi+1 )
(5.34)
i=1
where g(Pi , Pi+1 ) is the cost associated with the link (Pi , Pi+1 ). In the context of contour tracking, we can consider that the image is represented as a graph, and a path P1 , P2 , . . . , Pm constitutes a contour with initial pixel P1 and final pixel Pm . Each node contains the associated information of the pixel, i.e. the module of the gradient ∇ Pi and the direction of the contour θ (Pi ) which is orthogonal to the direction of maximum gradient. A graph can be constructed if we consider links (Pi , P j ) that connect the corresponding adjacent pixels Pi and P j with the criterion of 8-connectivity and if the local direction θ (Pi ) and θ (P j ) is compatible with the direction of contour within certain limits. For example we can consider valid the link (Pi , P j ) if the direction of P j as successor of Pi is in the interval [θ (Pi ) − π/4, θ (P j ) + π/4] and if the values of the gradient ∇(Pi ) and ∇(P j ) are above a default threshold value S which indicates a good probability of pixel element of the contour. Search for the contour between pixels PA and PB as shown in Fig. 5.14 there are different paths. The choice of the optimal path can be evaluated with the following heuristic search procedure: 1. Consider the starting pixel PA , initialize the list of the contour Path and the cost function g A (k) of the path from the node PA to the node Pk . Each element of the Path list maintains the index of the nodes of the path being processed and a pointer to the previous node of the optimal path. 2. Expand the initial node PA by generating all its successors and insert these successor nodes into the Path list. Successors are evaluated according to the direction of adjacent pixels as indicated above. Calculate the cost function of the path of the PA node at each successor node found. The sum of the modules of the gradient along the various paths ranging from PA to the found successor nodes can be used as cost function. 3. If the Path list is empty, it indicates an error condition. Otherwise, it determines in the Path list the pixel Pk to which a maximum value of the cost function g A (k) is associated.
294
5
Image Segmentation
Fig. 5.15 Labeling of the connected components, each represented in a different color; the numbers indicate the temporary labels assigned in the first phase. The numbers with the circle indicate the pixels in which the equivalence relation is detected and annotated
If Pk = PB , the optimal path is completed and the procedure ends. This path consists of all the nodes of the Path list that have the pointer not null to retrace the optimal path backwards. 4. Expand the Pk node identified in the previous step and insert all its successors in the Path list by generating pointers to the relative node Pk . Calculate the cost function of the alternative paths associated with each Pk successor found. Return to step 3. This procedure can be modified to manage closed contours and in this case the starting node PA with maximum value of the gradient coincides with end node PB . In general the cost function g A (B) associated with the path between the pixels PA and PB can be conditioned by a generic intermediate pixel PK and is defined by the relation: g A (B) = g A (k) + gk (B) =
k i=2
g(Pi−1 , Pi ) +
m−1
g(P j , P j+1 )
(5.35)
j=k
where the component g A (k) indicates the cost function relative to the path from the initial pixel PA to the intermediate pixel Pk , while the component gk (B) indicates the function relative to the path between the intermediate pixel Pk and the extreme pixel PB of the contour. Figure 5.15 shows the results of the procedure applied to the 5×8 image of the previous example. The figure shows the values of the ∇ Pk gradient for each pixel of the image and the direction of the contour θ (Pk ) which is orthogonal to the direction of maximum gradient. Each Pi node is expanded by creating successors P j that satisfy the following conditions: (a) |θ (PI ) − θ (PJ )| < 90◦ ∀ successor PJ o f PI (b) (PI , PJ ) they are adjacent with only three directions of the type 8-path We want to extract the contour from the pixel PA ≡ P1 to the final pixel PB ≡ P8 . The successors of P1 satisfying the conditions indicated above are the pixels P2 and
5.5 Segmentation Based on Contour Extraction
295
P21 with g A (2) = g(PA , P2 ) = ∇ PA + ∇ P2 = 9 and g A (21) = ∇ PA + ∇ P21 = 8. Therefore the pixel P2 is selected (with higher gradient value) while the P21 is discarded. The successors of P2 are the pixels P3 , P18 and P19 , and the corresponding cumulative functions of maximum gradient are respectively: g A (3) = ∇ P1 + ∇ P2 + ∇ P3 = 15 g A (18) = ∇ P1 + ∇ P2 + ∇ P18 = 10 g A (19) = ∇ P1 + ∇ P2 + ∇ P19 = 14 However, P3 is selected as path pixel with maximum gradient value while P18 and P19 are discarded. The process is repeated until reaching the final pixel PB obtaining at the end the path P1 , P2 , . . . , P8 which corresponds to the maximum value of the cumulative function of the gradient g A (B). In general the heuristic methods of searching for a path in a graph do not always guarantee optimal results when applied for the automatic search of the contours. When appropriately adapted to specific real problems and choosing appropriate path evaluation functions, acceptable results can be obtained with adequate calculation times.
5.5.3 Connected Components Labeling A labeling algorithm aims to automatically label all connected components that are in a binary image. The image is normally analyzed sequentially by row and all the pixels are grouped into connected components based on the analysis of their connectivity. In essence, all the pixels belonging to a connected component share the same property, for example, they have similar intensity values and are connected to each other. Once the pixels are grouped, each of them is attributed with a label representing a gray-level or color according to the connected component to which it is assigned. Let’s see now how a labeling algorithm works. Starting from a binary image, this is sequentially analyzed row-by-row (starting from the first row) and each pixel is examined (from left to right) to identify pixels that belong to the same region or adjacent pixels that have the same intensity. In the case of binary images it assumes intensity V = 1, whereas for images at grey levels the similarity value V is associated with a range of gray levels (the background of the image is normally assumed to be zero). The essential steps of a labeling algorithm are: 1. Scan of the first row y = 1, I (x, 1), x = 1, · · · , N . The pixels are examined until a pixel I (x, 1) with value V is encountered and the analysis of 4-connectivity begins. Since the first row and the first pixel to be labeled, the first label L = 1 is assigned to the pixel I (x, 1). For the sake of clarity, an output image is created E(x, 1) = L, the image of the connected components labeled. The scanning of the first row continues (x = x + 1) and as soon as I (x, 1) = 0 the adjacent left pixel I (x − 1, 1) is examined (see Fig. 5.15). I f I (x, 1) = I (x − 1, 1) the current pixel xth inherits the label of the previous
296
5
Image Segmentation
one E(x, 1) = E(x − 1, 1). I f instead I (x, 1) = I (x − 1, 1) a new label L = L + 1 is generated and the pixel xth is assumed to belong temporarily to a different connected component E(x, 1) = L. 2. Scan the remaining rows y = 2, · · · , M. The first pixel of each row should be examined as follows: i f I (1, y) = V and I (1, y) = I (1, y − 1) then E(1, y) = E(1, y − 1) that is, it inherits the same label as the upper adjacent; otherwise if I (1, y) = I (1, y − 1) then it creates a new label L = L + 1 and considers the current pixel (1, y) temporarily belonging to a new connected component E(1, y) = L. Continue to scan the rest of the pixels of the yth line by examining the pixels with 4-adjacency i.e. the adjacent left pixel (x − 1)th and the adjacent upper one (y − 1)th are examined (see Fig. 5.15). I f I (x, y) = I (x −1, y) and I (x, y) = I (x, y −1) then E(x, y) = E(x −1, y). I f I (x, y) = I (x −1, y) and I (x, y) = I (x, y −1) then E(x, y) = E(x, y −1). I f I (x, y) = I (x, y − 1) and I (x, y) = I (x − 1, y) then E(x, y) = L + 1. I f I (x, y) = I (x, y − 1), I (x, y) = I (x − 1, y) and E(x − 1, y) = E(x, y − 1) then E(x, y) = E(x, y − 1). The following situation should also be considered: I f I (x, y) = I (x, y −1), I (x, y) = I (x −1, y) and E(x −1, y) = E(x, y −1). This implies that the diagonal pixels in the left and upper position (respectively (x − 1, y) and (x, y − 1)), with respect to the current pixel (x, y) have the same value but different label. This tells us, in fact, that diagonal pixels really belong to the same component and, consequently, they should have the same label, but with sequential scanning, this situation could not be predictable. This explains the statement, indicated above, of temporary assignment of the new label. This situation is managed taking note that the two diagonal pixels must have the same label and the smallest value of the diagonal pixel is assigned to the pixel under examination (x, y). In other words, we can say that the two diagonal pixels are connected by the pixel under examination. Therefore we can assign to the pixel under examination the label of the left pixel, E(x, y) = E(x −1, y) ed, write down an equivalence between the labels of the diagonal pixels that is E(x − 1, y) ≡ E(x, y − 1). Figure 5.15 highlights the pixels that connect the diagonal pixels thus generating the equivalences between the labels. 3. Label reassignment. At the end of the previous step, all the pixels with a V value of the input image I (x, y) have been assigned a label, obtaining an output image E(x, y) whose labels must be updated based on annotated equivalences. Given the equivalence relations L i ≡ L j between the labels, the set of equivalence classes must be determined, to assign a unique label for each connected component. This is accomplished by organizing the set of equivalent label classes into a tree structure by forming a binary forest and using the Fisher and Galler algorithm [3] such a forest is visited in the advanced order order to extract the classes equivalent. The latter are stored in a vector L K T (k), k = 1, · · · , N _label, to be used as an associative table (look-up table) to find the correspondence of the
5.5 Segmentation Based on Contour Extraction
297
equivalent labels, belonging to the same class, that is, they identify a region with a unique label (see Fig. 5.15). The temporary output image E(x, y) is scanned again, row by row, and the values of the current labels are used as pointers in the L K T to assign a unique label to each connected component. The example in Fig. 5.15 shows the result of the first labeling phase (steps 1 and 2) to label 3 connected components (represented with different colors). The values of the provisional labels that the algorithm has assigned in the first phase, together with the pixels (numbered with circle), are highlighted, where the algorithm detects and annotates the equivalence relations, among the labels, assigned temporarily. In the second phase (step 3), the algorithm, starting from the detected equivalences (1, 2), (1, 6), (3, 4) and (5, 7), finds the equivalent classes {1, 2, 6}, {3, 4} and {5, 7} and assigns the definitive labels to the components. This is done using the labels of the first phase as a pointer to the L K T vector which contains the equivalent label classes, i.e. it definitively updates all the labels of the first phase as follows: E(x, y) = L K T (E(x, y)). In many image analysis applications (area calculation, detection and counting of isolated objects, . . .) it is important to automatically detect and label existing connected components in a binary image.
5.5.4 Filling Algorithm for Complex Regions In different applications such as computer graphics, image analysis, object recognition and in particular in computerized cartography it is necessary to produce thematic maps composed of regions with complex contours. The interior of each region must be characterized by a unique value, normally associating a color or gray level to represent a generic theme. Several filling algorithms [4,5] are based on the knowledge of the contours and on internal pixels used recursively as seeds. A quick method of filling is based on the raster1 scan which labels the regions in two steps. The input image contains only the complex contours of the regions without knowing the coordinates of the contour points. With the raster approach of the input image, only the value of the pixels belonging to the boundary is known and to each internal pixel of each region is assigned automatically a unique value in a manner analogous to the labeling algorithm described above for labeling the connected components. In [6] a raster-filling algorithm is described which directly labels the regions with closed and complex contours in two raster scans of the input image. The first step uses the labeling algorithm similar to the one described in the previous paragraph, appropriately adapted to analyze the boundary pixels instead of the pixels of the connected components. The second step is identical, find the equivalent label classes
1 The word raster
derives from television technology to indicate the horizontal scan of a monitor’s video signal. In computer graphics it is a term used to indicate the grid (matrix) of pixels constituting a raster image or bitmap image.
298
a)
5
b)
Image Segmentation
c)
Fig. 5.16 Results of the region filling algorithm. a The input map with the contours of the regions; b The output map after the first algorithm step; c Final result after the two steps of the algorithm
and update the output image with the vector L K T calculated by using the Fisher and Galler algorithm [3] the set of equivalence classes found in the first step. In Fig. 5.16 the results of this filling algorithm are shown, applied to a geological map characterized by regions with complex contours, some of which are also nested. In some contexts, the final result can be displayed on a computer monitor, without updating the output image produced in step 2, but directly loading the L K T vector into the look-up table of the video memory, obtaining the final result (see Fig. 5.16c) immediately on the monitor.
5.5.5 Contour Extraction Using the Hough Transform As an alternative to the previous methods, based on the local analysis (adjacency, orientation, gradient, etc.) of the contour pixels, for the extraction of the contours and the connection of any interruptions present, we will now consider a method based on a global relation among the pixels. This global relationship between the pixels is appropriate for the segmentation of images that presents objects whose shape and size are known. For example, in the case of remotely sensed images, for the study of the territory, it is essential to identify particular geometric structures (lines, polygons, etc.) that represent roads, bridges, cultivated lands, and so on. In the inspection of printed circuits it is necessary to identify linear and circular structures, as well as in industrial automation, the vision system of a robotic cell locates, in the work plan, objects with a predefined circular shape. In relation to these contexts, the contour extraction algorithm can be specialized to automatically search for particular contours associated with predefined geometric figures such as lines, circles, polygons, etc. The Hough transform [7,8] can be used to solve the segmentation of the image under the conditions indicated above and more generally also in conditions of objects that touch or overlap, generating more complex contours.
5.5 Segmentation Based on Contour Extraction
299
5.5.5.1 Hough Transform for Line Extraction The Hough transform aims to evaluate structural and non-structural relationships between the pixels of the image. For example, given an image, we want to verify if there are significant linear geometric structures, in any direction and in any position in the image. Duda [9] was the first to use the Hough transform to extract lines in an image. The problem is reduced to verifying if there are sets of pixels aligned along a straight line. One way to solve this problem is to consider two pixels first, then identify a line passing through these pixels, and then check all the significant pixels that are well aligned or not far from the line. If the pixels to be analyzed are n, this procedure should be repeated n(n − 1)/2 ∼ = n 2 times and then check the alignment by considering each of the n points with all the lines requesting about n × n(n − 1)/2 ∼ = n 3 comparisons. The calculation time required would be impossible. An intelligent solution to this problem is given by Hough transform. In the image plane, the equation of a bundle of lines, passing through a pixel (xi , yi ), in explicit form, is given by: yi = pxi + q
(5.36)
where p and q are, respectively, the slope and intercept parameters of each line of the beam. The previous equation can be rewritten in the form: q = yi − pxi
(5.37)
defining in this way a new pq plane which is called parametric plane. In essence, in the parametric plan x and y become parameters while p and q are variable (inversion of roles). It is observed that at the pixel (xi , yi ) in the xy image plane corresponds to a single straight line given by the (5.37) in the parametric pq plane. If we consider (see Fig. 5.17a) a second pixel (x j , y j ) in the image plane, it corresponds in the parametric plane (see Fig. 5.17b) again a new line q = y j − px j Image Plan (x,y) Parametric Plan (p,q)
y 0
0 (xk,yk)
q=yk-pxk
(xj,yj) (xi,yi)
q
q’
(p’,q’) p’
q=yj-pxj x
y=px+q p
q=yi-pxi
Fig. 5.17 Point - line correspondence between image plane (x, y) and parametric plane of Hough ( p, q). Different lines that intersect at the same point ( p , q ) in the parametric plane correspond to the different points (x, y) that belong to the line y = px + q in the image plane
300
5
Image Segmentation
which intersects at the point ( p , q ) with the previous line, associated with the pixel (xi , yi ). We observe again that the point ( p , q ) is the point of intersection (i.e. common point) in the parametric space of the lines respectively associated with the pixels (xi , yi ) and (x j , y j ) that belonging to the same line y = px + q in the image plane. From this it follows that, if we consider a third pixel (xk , yk ) aligned with the same line y = px + q, in the parametric plane a third line is generated with the characteristic of intersecting with the other lines in the same point ( p , q ). The result of the Hough transform, of representing a line of the image plane with a point ( p , q ) in the parametric space, can be used to solve the problem initially posed, i.e. to verify if n points in the image plane are aligned. More generally, we can face the problem of automatically extracting the linear structures of a contour without having a prior knowledge. It is known that the starting information is the value of the gradient and the direction of each pixel obtained by applying the known edge extraction algorithms. By means of a predefined threshold (or other methods) the most significant contour elements can be extracted (maximum value of the gradient). In relation to a particular application, it may be required to identify lines with a limited number of directions, thus simplifying the Hough transform. In fact, the reduction of the possible directions of the lines to be searched in the image leads to a limited discretization of the angular parameter p and consequently also for the parameter q. This leads to subdivide the parametric pq-plane with a not very dense grid and each cell represents an appropriate interval of the parameters p and q. The parametric plane thus defined is completely represented by the accumulation matrix A( p, q) where in this case the indices p and q identify the cells of the discretization grid (see Fig. 5.18). At the beginning of the procedure the A matrix is
Parametric Plan (p,q)
qmin pmin
0
qmax
q
0 (p’,q’)
pmax p Fig. 5.18 Discretization in cells of the parametric plane of Hough ( p, q). Each accumulation cell ( p, q) shows the number of straight lines. The cell ( p , q ) represents the parameters p and q
which in the image plane identifies the line of equation y = p x + q which satisfies the set of points (x, y) lying on the same line. With reference to Fig. 5.17 the cell ( p , q ) has been increased by 3 times, while other cells affected by the three lines are incremented only once
5.5 Segmentation Based on Contour Extraction
301
initialized. For each pixel P(xk , yk ) element of a potential contour, all the elements of the accumulation matrix A( p, q) identified by the straight line q = yk − pxk are incremented by one unit, that is varying p from cell to cell we calculate the corresponding values of q. If in the image there is a contour with a straight line, for each pixel belonging to it, the accumulation matrix A will be modified in increments A( p, q) = A( p, q) + 1 for all the values ( p, q) generated and corresponding to each pixel P(xk , yk ). Repeating the process for all the contour pixels, at the end of the procedure, it will be observed that an element ( p , q ) of the accumulation matrix, will have a maximum value Q which theoretically coincides with the number of pixels of the contour rectilinear analyzed (y = p z + q ), while all the other elements will present at the maximum value 1 (they can remain at zero in relation to the level of discretization of the parameters p and q). In other words, the presence in the accumulation matrix of a single peak in a single element ( p , q ), implies that the transformed pixels were collinear in the image plane and belonging to the equation line y = p x + q . In real images, due to the noise, the accumulation matrix has several peaks that probably indicate different linear structures in the image. In other words, the search for linear structures in the image that can generally be complex, with the Hough transform reduces to the search for local maxima in the accumulation matrix. Attention, with this approach, if the contour presents different interruptions, the linear structures are anyway identified in the parametric space. In this case, the lack of pixels in the interrupted sections of the boundary, has only the effect of having lower peaks in the accumulation matrix, that is, there will be small values of Q up to correspond to isolated points. As shown in Fig. 5.18, in the real images both for the presence of noise and for the non-perfect collinearity of the pixels, a linear structure is not accumulated in the parametric space in a single point, but involves a small area, whose center of mass is calculated for identify the parameters ( p , q ) of representation of the linear structure. This also depends on the resolution of the cells of the parametric plane or the level of discretization of p and q. If p is discretized in K cells, for each generic pixel P(xi , yi ) we will have K values of q. It follows that, for single linear structures with n pixels, there will correspond a linear complexity of calculation of n K . When they are considered vertical linear structures, the equation of the straight line in explicit form y = px + q, is no longer adequate because the parameter p → ∞ ans is no longer easily discretizable (it would become too large). To eliminate this problem, another parametric representation of the lines is used, expressed in polar coordinates: ρ = x cos θ + ysinθ (5.38) where ρ is the shortest distance of the line from the origin and θ is the angle between the perpendicular to the line and the x axis. In this case, one pixel (x, y) of the contour still corresponds a point (ρ, θ ) in the new parametric space ρθ and the properties of the Hough transform, to identify a line of the image plane with a point in the parametric plane, remain unchanged. In fact, as shown in Fig. 5.19b, a line in the spatial domain is projected at a point in the parametric domain. The only difference is that in correspondence of the pixel
302
5 Image Plan (x,y)
y 0
(xj,yj)
θ
0
θ’
Parametric Plan (ρ,θ) θmin θ ρmin
Image Segmentation
0
θmax θ
ρ=xjcosθ+yjsinθ
ρ ρ’
(xi,yi)
0 ρ=xicosθ+yisinθ
x
ρ
a)
b)
ρmax ρ
c)
Fig. 5.19 a Representation of the straight lines in polar coordinates in the image plane and b its representation in sinusoidal curves in the corresponding parametric plane (ρ, θ). (ρ , θ ) is the intersection point corresponding to the line passing through the points (xi , yi ) and (x j , y j ) in the image plane. c The discretized parametric plan (ρ, θ) is also reported
(xi , yi ) of the image plane, in the parametric domain there is no longer a straight line but a sinusoidal curve. The collinearity of the pixels in the spatial domain (ρ 0 = xcosθ 0 + ysinθ 0 ) is verified, as before, by finding in the accumulation matrix the point (ρ 0 ,θ 0 ) that corresponds to the common point of intersection A(ρ0 , θ0 ) of the M curves in the parametric domain. In particular, a horizontal line of the image plane is represented in the parametric plane with ρ ≥ 0, θ = 0◦ (where ρ equal to the intercept with the y-axis) while a vertical line with ρ ≥ 0, θ = 90◦ ) (where ρ equal to the intercept with the x-axis). Also with this parametric representation it is important to adequately quantize the parametric space (ρ,θ ) represented again by the accumulation matrix A(ρ, θ ) with ρ ∈√ [ρmin , ρmax ]; θ ∈ [θmin , θmax ]. For an image of size N × N , the range of ρ is ±N 2. Also in this new parametric space, if M are the pixels of the line in the image plane, in it we will have M sinusoids that will intersect in the common accumulation point that will characterize this line. Figure 5.20 shows the results of the application of the Hough transform for the automatic detection of linear structures for 3 types of images. So far, only linear structures in the image plane have been modeled in the parametric plane. Let us now see how the Hough transform can be generalized for the automatic search for contours with more complex geometric figures. In general, a contour can be represented by the more general equation: f (x, y, a1 , a1 , ...., an ) = f (x, y, a) = 0
(5.39)
where a represents the vector of the curve’s parameters. The search for a curvilinear contour using the Hough transform requires the following procedure: 1. The parametric domain (a1 , a2 , . . . , an ) must be appropriately quantized by defining an accumulation matrix A(a1 , a2 , . . . , an ) which is initially zeroed. 2. For each pixel (x, y) of the image, with maximum value of the gradient, or with value of the gradient higher than a predefined threshold, increase by one unit all the cells concerned of the accumulation matrix A(a1 , a2 , . . . , an ) that satisfy the
5.5 Segmentation Based on Contour Extraction
303
a)
b)
f)
e)
d)
c)
g)
h)
Fig. 5.20 Detection of lines using the Hough transform. a Original image with 6 lines; b Hough plan with the 6 peaks relative to the six lines; c Original image of the contours; d Hough plan with the 30 most significant peaks; e Detected lines (indicated in yellow and red at the ends) superimposed on the original image; f Remote sensing image; g Hough plan for the image (f); h Significant linear structures detected in the image (f)
parametric Eq. (5.39) according to the parameter definition limits ai defined in step 1. 3. The accumulation matrix A is analyzed. Each significant peak A(a1 , . . . , an ) is a candidate to represent in the spatial domain the curve of origin f (x, y, a1 , . . . , an ) = 0.
5.5.5.2 Circle Hough Transform-CHT For simplicity, we want to automatically recognize circular structures in the image. These structures can be approximated by a circle whose analytical description is given by: (5.40) (x − a)2 + (y − b)2 = r 2 where with (a, b) is indicated the center of the circle and with r the radius. It is observed that, in this case, a 3D accumulation matrix A(a, b, r ) is required associated with the parametric space provided for the parameters a, b and r . A pixel (x1 , y1 ) in the image plane, element of a circular contour of radius r and center (a1 , b1 ), generates in the parametric space a equation circle: (a − x1 )2 + (b − y1 )2 = r 2
(5.41)
which is the locus curve of the centers of all circles of radius r and passing through (x1 , y1 ) in the image plane. Considering all the points (xi , yi ) of the circular structure and repeating the same reasoning in the parametric space, we generate as many circles that intersect in a single point (a1 , b1 ), which coincides with the coordinates of the center of the circle from which the points (xi , yi ) are derived.
304
5
Image Segmentation
Image Plan (x,y) Parametric Plan (a,b,r) b’ 0 C₃(x₃,y₃)
y 0 (x₂,y₂)
(x₃,y₃) C₂(x₂,y₂)
a’
(x₁,y₁)
r
b r’’ Plan of accumulation
r’
C(a’,b’) C₁(x₁,y₁)
x
b a
a
Fig. 5.21 Image plane and 2D and 3D parametric Hough plane for the detection of circular figures. Each pixel in the image plane generates a circle Hough in the parametric space that can be 2D if the circular figures to be detected have a constant radius or 3D with a variable radius. The point (a , b ) in the parametric plane, where the circles generated by the pixels of the image plane intersect, constitutes the center of the circle detected in the image plane
Therefore also in this case, the set of pixels that describe a circle in the image plane, when transformed into the parametric space with the (5.41), produce a single peak at a point in the accumulation matrix (see Fig. 5.21). If there are several circular structures in the image plane with a constant radius, in the accumulation matrix, there will be different peaks at the positions (ai , bi ) that represent the centers of the circles existing in the image with good reliability. When the radius r of the circular structures is not known a priori, the parametric space is three-dimensional (see Fig. 5.21), with the consequent considerable increase in the computational load. The latter grows exponentially together with the parametric space with the increase of parameters describing more complex curves. In order to considerably reduce the computational complexity, the Hough transform is applied to the gradient image considering also the direction information if available for each pixel element of curved structure. Returning to the search for circular structures with known constant radius, the Hough transform was applied to each edge pixel (xi , yi ) with the consequent increase in the accumulation matrix of all the cells (a, b) belonging to the circle of center (xi , yi ) in the parametric space. If the direction of the contour is used for each pixel, the number of cells to be increased is greatly reduced. For this purpose it is convenient to rewrite the equation of the circle in the parametric form but in polar coordinates: x = a + r cosθ
y = b + rsinθ
(5.42)
The direction (xi , yi ) of the contour in each pixel (xi , yi ) is estimated by the edge extraction process (together with the gradient image), with a certain error that can be evaluated in relation to the application context. From this it follows that in the parametric space, the cells (a, b) to increment are only those calculated by the following functions: a = x − r cos(ϕ(x, y))
b = y − r sin(ϕ(x, y))
(5.43)
and that meet the following condition: ϕ(x, y) ∈ [(x, y) − , (x, y) + ]
(5.44)
5.5 Segmentation Based on Contour Extraction
305
Fig. 5.22 Detection of circles by the Hough transform. In the first image the circles with variable radius 60÷80 pixels were detected while for the second image the coins have the same diameter (about 40 pixels)
where we remember that (x, y) is the direction of the edge in the pixel (x, y) and is its maximum error expected. Figure 5.22 shows the results of the Hough transform for the automatic detection of the circles applied on two images with circular objects of variable dimensions (in the first) and constants (in the second image).
5.5.5.3 Generalized Hough Transform - GHT When the geometric structure of the contour to be extracted is complex with the need to use a considerable number of parameters or when it is not easy to find an analytical description, a solution to the problem is represented with the generalized Hough transform proposed in 1981 by Ballard [10]. This method determines a contour of complex shape based on the a priori knowledge of its model which is acquired initially through a learning and coding process. As a first approach, we consider that the model and the contour to be extracted are invariants in size and rotation. In Fig. 5.23 is the sample contour (model). An internal reference point P(x, y) is preferably chosen in the central area of the region. A still oriented semi-ray is drawn starting from P(x, y) and intersects the contour at the point C(x1 , y1 ). The direction of the border at the point C(x1 , y1 ) is represented by the line t1 whose normal is the direction of the maximum gradient. Now consider the following parameters associated with the selected reference point P(x, y): (a) θ1 edge orientation in the point C(x1 , y1 ) with respect to the x-axis; (b) α1 (θ1 ) angle formed by the half-line r1 with the x-axis; (c) R1 distance of the edge C(x1 , y1 ) from the reference point P(x, y). Continuing to trace several half-lines always starting from the reference point P, the boundary will intersect at different points C1 , C2 , . . . , Cn . For each point Ci of the contour, by construction, the values of Ri , αi are calculated, and the direction θi of the tangent is determined. These terns (θi , Ri , αi ; i = 1, . . . , n) of values are stored in a table R which constitutes the encoding of the contour model together with the coordinates of the reference point P(x, y). For convenience, the table can be sorted by increasing value of the edge direction θi . It is also observed (see Fig. 5.23) that in some points the
306
5
Image Segmentation
y C(x
θ
,α ) P(x,y)
)
θn
(Rn
n
n
n
,α
)
α θ 0
x
Fig.5.23 Image plane of geometric modeling of the Generalized Hough Transform (GHT). Generic contour and related model construction table
contour can have the same direction, i.e. θ (xi , yi ) = θ (x j , y j ). These cases are considered by inserting in each row ith of the table the corresponding m i pairs (Rik , αik ; k = 1, . . . , m i ) which have the same direction as the edge θi . Now let’s see how the transform works for the contour extraction for real images. From the gradient image, each pixel (x, y), with significant gradient value and direction θ , is analyzed to determine a possible location (a, b) of the reference point. This is done with the following equations: a = x − R(θ ) cos(α(x, y))
b = y − R(θ ) sin(α(x, y))
(5.45)
The calculation of the parameters a and b (which generate the accumulation matrix A(a, b)) is possible, by selecting in the R-table, the list of pairs (Ri , αi ), in relation to the direction of the edge θ (x, y). The complete algorithm is the following: 1. Generate the R-table of the desired contour. 2. Quanti ze the parametric space appropriately and initialize the accumulation matrix A(a, b). 3. E xamine each pixel (x, y) element of the contour of the gradient image with direction θ (x, y). Using θ (xi , yi ) select the pairs (Ri , αi ) from the R-table and using the previous equations calculate the presumed reference point (ai , bi ) of the contour to be searched. Increase the accumulation matrix: A(ai , bi ) = A(ai , bi ) + 1. 4. E xamine the accumulation matrix A by selecting the most significant peak or peaks representing the potential reference points associated with the desired contour. In summary, the generalized Hough transform can be used to extract contours of objects with complex geometric figures by comparing the object with its previously acquired model in the parametric space. The previous algorithm can be modified in steps (3) and (4) to extract contours that have changed their size by a scale factor s with respect to the model and the orientation of an angle ψ with respect to the x-axis. This involves the extension of the parametric space with an 4D accumulation
5.5 Segmentation Based on Contour Extraction
307
matrix A(a, b, s, ψ). However, the equations of step (3) of the algorithm would thus be modified: ai = x − R(θ ) · s · cos(α(θ ) + ψ) bi = y − R(θ ) · s · sin(α(θ ) + ψ) A(ai , bi , s, ψ) = A(ai , bi , s, ψ) + 1
(5.46) (5.47) (5.48)
If compared to the model the object does not change scale, i.e. s = 1, and does not change in the rotation ψ = 0, the operations are reduced to those described in the previous algorithm.
5.5.5.4 Final Considerations on the Hough Transform (a) The Hough transform is a very robust method for the extraction of contours with linear geometrical structures and curves that can be analytically described. (b) It is also used for the extraction of contours with complex figures that cannot be analytically described but by a priori modeling the shape of the contour in the parametric space by reference points associated with each contour. (c) The robustness of the method concerns above all the independence from noise if present in the image, which normally causes frequent interruptions in the contours. (d) The Hough transform, unlike other contour extraction algorithms, becomes really peculiar for the extraction of contours for very noisy images. We remember his ability to identify elements of interrupted contours (often due to noise). (e) It is very effective even in the case of extraction of interrupted contours due to objects touching or partially overlapping or due to uneven lighting conditions. (f) From the implementation point of view, the Hough transform requires considerable memory capacity for the management of the accumulation matrix and considerable computational resources for the calculation of the accumulation matrix and for the analysis of the peaks especially when we have a high number of parameters. (g) A real-time version of the transform is realized with multi-processor systems restructuring the algorithm for a suitable parallelism model. (h) The Hough transform is also used to validate edge-linking algorithms for the problem of completing contours in broken lines. This is achieved by superimposing on the boundary with interrupted traits of the input image, the complete contour reconstructed with the Hough transform. (i) Finally, for some applications the interrupted lines of the boundary resolved with the Hough transform are more effective results than the edge tracking algorithms that do not always give good results.
5.5.5.5 Randomized Hough Transform - RHT The Randomized Hough Transform [11] is a probabilistic variant of the original Hough transform-HT. RHT is different from HT in that it tries to minimize the computationally expensive voting operations for each non-zero pixel of the image,
308
5
Image Segmentation
exploiting the geometric properties of analytical curves, and therefore improving efficiency and reducing the memory requirements of its algorithm. Although HT has been widely used for extracting curves in the image, it has two limitations. First, for each pixel other than zero in the image, the parameters for the existing curve and those for redundancy are both accumulated during the voting procedure. Second, the accumulation matrix is heuristically predefined. The greater accuracy for the extraction of a curve implies the definition of parameters with high resolution. These two requirements generally involve a great need for memory of the accumulation matrix with the consequent increase in the calculation time and therefore the consequent reduction in speed which becomes critical for real applications. RHT attempts to mitigate this limitation. RHT exploits the fact that some analytic curves can be completely determined by a certain number of points on the curve. For example, a straight line can be determined by two points, and an ellipse (or a circle) can be determined by three points. In fact, by randomly selecting n pixels from an RHT image, it attempts to approximate them with a parameterized curve. If these pixels approximate this curve, within a predefined tolerance, they are added to the accumulation matrix by assigning a vote. Once a set of pixels has been selected, the curves with the best score are chosen by the accumulator and the related parameters are used to represent a curve in the image. Since, only a small random subset of n pixels have been selected, this method reduces the memory and computational time requirements necessary to detect the curves of an image. With RHT, if a curve in the accumulator is similar to the curves to be tested, the curve parameters are averaged between them and the new averaged curve replaces the curve in the accumulator. This reduces the difficulty of finding local maxima in the Hough space since only one point in the Hough space represents a curve, instead of determining a local maximum. RHT has been successfully tested in various applications for the extraction of objects with elliptical and non-contours. As an alternative to RHT, the RANSAC (RANdom SAmple Consensus) [12]. In this case, the pixels that must satisfy a type of curve to be extracted are randomly selected and then the other pixels falling in that curve are tested. Curves with a high number of inlier (i.e. pixels whose distribution can be characterized by the parameters of the hypothesized curve) are selected as detected and accepted, with respect to those characterized by the outlier s i.e. those pixels that do not approximate the hypothesized curve. The approach is not deterministic but produces a probabilistically acceptable result with the growth of feasible iterations. RANSAC has the advantage of not needing accumulation memory even if the problem remains of formulating different model hypotheses of the curves to be detected as it is not possible to choose a peak in Hough space.
5.6 Region Based Segmentation
309
5.6 Region Based Segmentation The algorithms studied above had the objective of segmenting an image from the direct extraction of the contours associated with the objects of the scene or from the analysis of the histogram the optimal thresholds were determined. The use of these algorithms is limited by the presence of noise that influences the extraction of the edges and the determination of the thresholds. Furthermore, no spatial pixel information was used. The segmentation based on the regions is an alternative approach also in the context of very noisy images. In Sect. 5.3 we have already described the general approach of region-based segmentation, namely how to partition an image into homogeneous groups of pixels representing a particular property. Region-based segmentation is characterized by the way in which similar pixels are aggregated. In fact, the segmentation methods are classified into 3 distinct categories: region growing, region splitting and region split and merging. The region growing methods start from single pixels of the image and trigger a process of aggregation of neighboring pixels based on similarity properties. The region splitting methods instead use an opposite approach, i.e. they start by dividing the image into large regions and then trigger a process of subdivision based on the diversity of the pixels. The split and merging methods are instead a combination of the two previous methods from which, in theory, they can benefit from their advantages.
5.6.1 Region-Growing Segmentation The basic idea of this approach is to initially divide the image into different regions (seed regions) that fully satisfy a given homogeneity criterion. The criterion (predicate) of homogeneity can be based on one or more characteristics of the regions: gray level, texture, color, statistical information, geometric model, and so on. Subsequently the initial homogeneous regions are analyzed by applying a homogeneity criterion for the inclusion or not of the pixels of the neighboring area between regions. A general outline of the growth algorithm of the regions is the following: 1. The input image f (i, j) is subdivided into n small regions Rk which satisfy a predefined homogeneity predicate: P(Rk ) = T r ue
k = 1, . . . , n
2. A heuristic criterion is used to merge two adjacent regions. 3. Merge all adjacent regions that meet the criteria defined in (2). If all regions meet the following conditions: P(Rk U Rl ) = False with k = l and Rk ad jacent to Rl P(Rk ) = T r ue
k = 1, . . . , n
(5.49) (5.50)
then the segmentation procedure is completed. From this scheme it is highlighted that the results of the segmentation depend on how the image is initially divided into small regions, the fusion criterion of the regions and
310
5
Image Segmentation
the order in which the regions themselves are analyzed for their growth. In fact, due to the sequential nature of the algorithm it can happen that two almost homogeneous adjacent regions Rk and Rl cannot be merged into a single region if previously in Rk is another region was merged by significantly modifying its characteristics and thus preventing the merger with the region Rl . Generally the initial image is divided into small regions of 2 × 2, 4 × 4 or 8 × 8 pixels in relation to the type of application. The homogeneity criteria can be different even if the most widespread is the one based on the statistical information of the gray levels. In the latter case the histogram of each region is calculated. The characteristics of a region are compared with those of neighboring regions. If within certain values these characteristics are comparable with those of adjacent regions, these adjacent regions are merged into a single region for which the new features are calculated. In the opposite case, the regions are coded as not homogeneous and are marked with appropriate labels. This fusion process is repeated for all adjacent regions, including new ones until all regions are labeled. Let us now analyze some of the possible methods of growth of regions based on the aggregation or fusion of elementary regions. The previous algorithm would thus be extended: 1. Choice of seed pixels. In relation to the type of image, it is decided in how many regions partition it, that is to say the number n of pixel seed and their position Si (x, y) in the relative regions. 2. Check the criterion of similarity between pixels. A similarity criterion P(Ri ) applied for the region Ri can be the difference between the initial value of the pixel seed Si with the relative pixels 8-connectivity: P(Ri ) = T r ue i f |z k − Si | < T or |μ(Ri ) − Si | < T
(5.51)
where T is a threshold chosen on a heuristic basis, for example considering the difference in intensity levels between neighboring pixels and seed pixels (T = c(max z − min z ), 0 < c < 1) or the difference between the average value μi of the region Ri and that of the seed pixel. If the predicate (5.51) is satisfied the neighboring pixels will be aggregated to the region Ri . 3. Redefining the pixel seed Si (x, y). This, based on the Ri region modified with the new aggregated pixels. Also recalculate the average μi of the modified region Ri . 4. Repeat steps 2 and 3. This iteration occurs until all the pixels in the image have been assigned to the regions, that is, no other pixel satisfies the predicate (5.51). The approach described depends on how the user chooses the threshold and the seed pixels that are representative of possibly well-uniformed regions (evaluating color or levels of gray). Figure 5.24 shows the result of the exposed segmentation algorithm. To mitigate any problems of fragmentation it is possible to use some criteria to merge adjacent regions in particular when the boundary is not well evident. An algorithm that merges adjacent regions with weak contours also in relation to the length of the contour itself is the following:
5.6 Region Based Segmentation 1
1
2
311
3
4
1
2
3
4
2
3
4
Fig.5.24 Segmentation through the growth of the regions. From left to right: image to be segmented with the 4 selected seeds, followed by the 4 extracted regions corresponding to the 4 seeds using thresholds with T = 0.05 − 0.2, and final result
1. Calculation of elementary regions. These are the starting conditions of the region growing algorithm. These elementary regions are formed by aggregating pixels with a constant gray value and according to 4-connectivity or 8-connectivity. 2. Fusion of adjacent regions. This is accomplished by applying the following two heuristic fusion methods: (a) Merge the adjacent regions Ri and R j , if W ≥ S2 min(Pi , P j )
(5.52)
where W represents the length of a border area between the two regions, characterized by pixels with insignificant differences with respect to a predefined S1 threshold value (| f (xi ) − f (x j |) < S1 ), Pi and P j are the perimeters of the regions Ri and R j , and S2 is another default threshold that controls the size of the regions to be merged. S2 is normally chosen with a value of 0.5, and prevents the merging of regions of equal size (S1 ∼ = 1) while it favors that small regions are included in larger regions. (b) Merge two adjacent regions Ri and R j , if W (5.53) ≥ S3 P where P is the length of the border area, common to the two regions, characterized by pixels with insignificant differences as defined above, and S3 is a third default threshold. The typical value of S3 is 0.75. The fusion of the two regions takes place only if the common border area is made up of pixels with insignificant differences. This method of growth of the regions produces acceptable results for images with little texture and not very complex scenes. Although the algorithms of region growing are conceptually simple, they have some disadvantages such as: the choice of the threshold happens by trial and error, the need to define the starting seeds, the assumption of homogeneity is not always guaranteed, cannot be applied to images with very regions intensity variables (due to shadows or a lot of texture).
312
5
Image Segmentation
5.6.2 Region-Splitting Segmentation The algorithms based on the splitting of non-homogeneous regions operate in the opposite way to the growth ones of the regions. These algorithms initially consider the entire image as a single region in which the condition P(R) = T r ue is not satisfied. Subsequently, each region is sequentially divided respecting the following conditions: n
Ri 1. f = R = i=1 2. Ri R j = ∅, 1 ≤ i, j ≤ n and i = j 3. P(Ri ) = T r ue i = 1, . . . , n 4. P(Ri R j ) = False, 1 ≤ i, j ≤ n and i = j
Ri adjacent to R j
This simple segmentation method satisfies the conditions 1 and 2 above while the third condition is not generally satisfied and the algorithm continues in the decomposition process to satisfy condition 3 and end when condition 4 is realized. Even if the same criterion is used of homogeneity P, the merging and splitting algorithms of the regions can produce different results. As an example, we can consider the "chessboard" image consisting of a regular set of black and white square regions. If the average gray level is adopted as a criterion of homogeneity for segmentation, applying a segmentation process based on the subdivision of the regions, the image is not segmented because the condition P(R) = T r ue is satisfied immediately applied to the entire image that assumes the average value of the gray level for each pixel. A segmentation based on fusion, on the other hand, produces correct results, since the same criterion of homogeneity, applied at the level of pixels starting from the entire image leads to the growth of the regions as squares with black or white pixels and the process ends until the image coincides with the chessboard. From this it follows that the merging and splitting algorithms of the regions are not always dual. The appropriate choice of the segmentation method to be adopted depends on the type of application considering the level of complexity of the image (texture level, overlapping objects, presence of shadows that alter the intrinsic uniformity of objects). In summary, a segmentation algorithm, based on the iterated decomposition in small regions, is the following: 1. I nitiali ze the starting conditions considering the whole image as a single region. 2. Pr ocess each region recursively as follows: 2.1 Break down the current region into equal parts, for example into four subregions; 2.2 Check if condition 3 (see above) is satisfied, i.e. if the estimated predicate is true for each of the new regions created in 2.1; 2.3 Continue until all the regions satisfy the predicate P that is the condition 3; 2.4 Terminate the algorithm when all 4 conditions have been met.
5.6 Region Based Segmentation
313
For different applications, and also to solve the example of the previous chessboard, the variance of regions’ intensity values can be used as a homogeneity criterion, the value of the difference between maximum and minimum pixels of a region or other Homogeneity criteria. If the variance is above a certain threshold value, the region being processed is divided appropriately. Another more complex criterion can be to compare the histogram of each region’s gray levels with a predefined distribution and produce a similarity measure. In general, splitting-based segmentation approaches of regions are normally more complex than growing based. A representation of the images using quadtr ee (see Chap. 7 Vol. I) is a good data organization to implement region splitting segmentation algorithms. A quadtr ee is a tree whose root is the whole image and each node has four descendants, that is, it decomposes the image into four quadrants (regions) if the homogeneity criterion P(Ri ) = False. Each additional quadrant is recursively decomposed into four quadrants, until the homogeneity criterion for each end node (leaf) is satisfied. The terminal nodes of the quadtree consist of square portions or images of different sizes. A segmentation of the image by organizing the regions in quadtree can produce the effect of block image as already seen for the cosine transform (DCT) for image compression. Finally, one drawback of the region splitting methods is that the final segmentation of the image can consist of regions with identical characteristics, thus obtaining an unwanted fragmentation of the image. In this case it is necessary to merge these regions in a single region by introducing a merging procedure in the splitting algorithm. Figure 5.25 shows the results of the splitting algorithm applied to a complex image by organizing the regions with quadtree hierarchy. The homogeneity criterion used is that of the difference of the absolute value between the maximum and minimum value of the image.
5.6.3 Split-and-Merge Image Segmentation This method of segmentation combines the advantages of single regional subdivision and fusion techniques. The split-and-merge segmentation method eliminates the
a)
b)
c)
d)
Fig. 5.25 Example of region splitting segmentation: a Original image; b Quadtree hierarchical organization to support segmentation; c Result of the region-splitting algorithm; d Result of the merging algorithm of the homogeneous regions applied to the intermediate result (c)
314
a)
5
b)
c)
Image Segmentation
d)
Fig. 5.26 Segmentation by splitting and merging algorithm: a Original image; b Hierarchical organization quadtree to support segmentation; c Contours of the image after the intermediate result of region-splitting segmentation; d Result of the merging algorithm of the homogeneous regions applied to the intermediate result (c)
drawback of identical regions obtained previously with the splitting method. In fact, during the segmentation process, splitting and merging activities can be cooperated. The merging activity is performed after the splitting one to merge adjacent regions into a single region whose pixels satisfy a homogeneity criterion: P(Ri ∪ Rk ) = T r ue The complete sequence of a split-and-merge segmentation algorithm is as follows: 1. Define a P homogeneity criterion and consider the entire image as a single R starting region. 2. Splits into four disjoint quadrants (regions) any region Ri where P(Ri ) = False. 3. Merges all adjacent regions Ri and R j for which P(Ri ∪ R j ) = T r ue. 4. Terminates the algorithm when no further splitting occurs (repeating step 2) or merging (repeating step 3). Figure 5.25d shows the result of the split-merging algorithm for the image (a). Figure 5.26 shows the result of the segmentation for the pepper s image, the hierarchical quadtree structure and the contours of the image highlighting the complexity of the homogeneous regions to be segmented. The homogeneity criterion used is that of the difference of the absolute value between the maximum and minimum value of the image. In the literature there are several segmentation algorithms with variants with respect to the above scheme. The homogeneity criteria Pi used are different, and the most common are: (a) P1 (Ri ); |z j − μi | ≤ 2σi P1 (Ri ) = T r ue ∀z j ∈ Ri or for a percentage S of the set of pixels in Ri , where z j is the intensity level of the jth pixel of the region Ri , μi is the average of the gray levels of the region Ri and σ is the standard deviation of the gray levels of Ri . (b) P2 (Ri ); max(Ri ) − min(Ri ) ≤ S1 then P2 (Ri ) = T r ue (c) P3 (Ri ); E(z j − μi )2 ≤ S2 then P3 (Ri ) = T r ue
5.6 Region Based Segmentation
315
(d) P4 (Ri ); {Hi Gray level histogram of Ri } Hi unimodal ⇒ P4 (Ri ) = True Hi multimodal ⇒ P4 (Ri ) = False (e) P5 (Ri ) = T r ue; i f σ 2 < S3 where σ 2 is the variance.
5.7 Segmentation by Watershed Transform The water shed transform can be considered as a region-based segmentation method. The fundamentals are morphological mathematics. It was introduced by Digabel and Lantujoul [13] and then improved by Beucher and Lantuejoul [14]. Subsequently, several algorithms of the watershed transform were developed with different variants both in terms of definition and implementation, to reduce the computational complexity [15,16]. At an intuitive level, this method recalls the phenomenological aspects of river dynamics or catchment basin dynamics. When a topographic surface is flooded by water, basins are formed from the minimum regions. As the flood continues, in order to avoid the merging of the basins, formed by different regions of minimum, dams are built. At the end of the flooding process from the surface, only the catchment basins and the dams will emerge (watershed lines). In particular, the watershed lines separate the catchment basins, each of which contains one and only one of the minimum points. In simulating this process of fluvial dynamics for the segmentation of an image two strategies can be adopted, one that finds the catchment areas or find the watershed lines. In other words, the watershed transform to segment an image would label the set of points belonging to the different basins or detect the lines of separation (the dams) between adjacent basins. There are many algorithms that implement the watershed transform, but, in general, they can be classified into two categories: the first based on the recursive algorithm that simulates the flooding of Vincent and Soille [17], the second based on Meyer’s topological distance function [15].
5.7.1 Watershed Algorithm Based on Flooding Simulation A gray-level image simulates a landscape that is a topographical surface Z (x, y) where (x, y) indicates the spatial coordinates of points on the surface and Z represents the value of the pixel that in this case simulates elevation. The flooding process is simulated by imagining that the water from the bottom of the valleys (the water of each valley has its own label) rises through the corresponding minimum holes, and before, waters of different valleys can be mixed, a dam, which is a water shed, which defines the contour between different regions. The image designed as a 3D topographic surface (see Fig. 5.27) would contain three types of points:
316
5 255
255
255
objects 1
Dam
0
Image Segmentation
2
3
0
0
Fig. 5.27 Watershed segmentation simulating the flooding of a topographic surface. Imagining a hole in every minimum local water floods its collection basins with uniform speed; while the water level rises a dam is placed to prevent the transfer between adjacent basins. The crests of the dams correspond to the watershed lines
1. Points that belong to a region of minimum. With M1 , M2 , . . . , M R we can indicate the sets that denote the coordinates (x, y) of such points with level of gray Z (x, y). 2. Points where a drop of water has a high probability that it falls into a single region of minimum (catch basin). With C(Mi ) we can indicate the coordinates of these points of a catch basin associated with a region of minimum Mi . 3. Points where a drop of water falling has an equal probability of falling into more than one region of minimum. These points make up the lines called watershed or the lines of separation between regions (the crests of a topographic surface). These lines are indicated with T [n] which represent the (s, t) coordinates for which the following relation is satisfied: T [n] = {(s, t)|Z (s, t) < n}
(5.54)
where n indicates the flood level or the intensity level of the image Z that varies from a minimum value min to a maximum value max. The objective of a segmentation algorithm based on the simulation of flooding, from the minimum points, of the topographic surface (represented by the image Z (x, y)) is to detect the watershed lines of separation of the basins. The essential steps of the watershed algorithm are summarized as follows: 1. Find the pixel with minimum value min and the maximum max of the image Z (x, y). Assign the coordinates (x, y) of min in the set Mi . The topographic surface is flooded with unitary increments from the stage n = min + 1 to n = max + 1. At the flood stage n we denote by Cn (Mi ) the set of coordinates (x, y) of the points in the collection basin associated with the minimum Mi . In each flood stage the topographic surface is seen as a binary image. 2. Calculate (5.55) Cn (Mi ) = C(Mi ) ∩ T [n] If (x, y) ∈ C(Mi ) and (x, y) ∈ T [n], Cn (Mi ) = 1 at the position (x, y); otherwise Cn (Mi ) = 0. At stage n the collection basins flooded C[n] are given by the union of the basins Cn (Mi ) that is: C[n] =
R i=1
Cn (Mi )
(5.56)
5.7 Segmentation by Watershed Transform
317
Increase the level n = n + 1. Initially we have that C[min + 1] = T [min + 1]. In each flood step n, it is assumed that the set of C[n − 1] coordinates has been constructed. The goal is to get C[n] from that of the previous stage C[n − 1]. 3. Find the set of connected components Q from the watershed lines T [n]. For each connected component q ∈ Q[n], we have three conditions: (a) If q ∩ C[n − 1] it’s empty: i. A new minimum is reached; ii. The connected component q is incorporated in C[n − 1] forming the new flooded basin C[n]. (b) If q ∩ C[n − 1] contains a connected component of C[n − 1], the connected component q is included in C[n − 1] to form C[n] with the meaning that q is found inside the inundated basin of a region of minimum. (c) If q ∩ C[n − 1] contains more than one connected component of C[n − 1] we have that: i. A ridge separating two or more basins has been reached; ii. So we need to build a dam inside the connected component q to prevent transfer between the catchment basins. 4. Build C[n] according to the (5.55) and (5.56) and set n = n + 1. 5. Repeat steps 3 and 4 until n reaches the value max + 1. The algorithm considered can simulate the process of flooding caused by water falling from above. Furthermore, the watershed algorithm can be applied not only directly to the image, but also to the gradient or to the transformed distance of the image. The watershed algorithm presented has the advantage of producing continuous watershed lines (connected paths are formed), that is, segmentation is carried out with regions whose contours are continuous, unlike other algorithms described above which produce interrupted contours. The disadvantage of the watershed algorithm is the over-segmentation problem (especially in the presence of noise) and computational complexity. Figure 5.28c shows the result of the watershed transform applied to the gradient image (b) of the original image (a).
5.7.2 Watershed Algorithm Using Markers The problem of the over-segmentation can be limited through an initial procedure that specifies, using mar ker s, the regions of interest. The markers can be assigned manually or automatically, and have a function similar to the seeds used in the algorithms of region growing. In this case the mar ker is a small connected component belonging to the image. Two types of markers are used:
318
5
Image Segmentation
a)
b)
c)
d)
e)
f)
g)
h)
Fig. 5.28 Segmentation by the watershed transform: a Original image; b Gradient image of (a); c Watershed transformation of the gradient image (b); d Points of the minimum regions extracted from the result (c) of the watershed transform; e Internal markers defined on the gradient image (b); f External markers obtained using the watershed transform to the binary image (d); g Modified gradient image obtained from the image of the internal markers (e) and external (f); h Final result of the segmentation obtained by applying the watershed transform to the image (g) and superimposing the original image
1. Internal marker associated with a region (object) of interest. 2. External marker associated with the image background. The selection of the markers is normally carried out through two phases, one defined for pre-processing and one for defining the constraints that the markers must satisfy. The pre-processing involves filtering the image with a suitable smoothing filter that has the purpose of reducing the irrelevant details of the image and consequently reducing the number of potential minimums that are the cause of the over-segmentation. An internal marker is defined with the following characteristics: (a) Represents a region surrounded by high altitude points. (b) The points in the region form a connected component. (c) All points of the connected component have the same intensity. The watershed algorithm described above is applied to the filtered image (smoo-thed) with the constraint that the minimum points are exclusively those corresponding to the internal markers selected with the characteristics described above. Figure 5.28h shows the results of the watershed algorithm using the markers applied to the result of the watershed transform of Fig. 5.28c obtained previously. The internal markers are the lightest regions (Fig. 5.28e) while the watershed lines obtained are in fact the
5.7 Segmentation by Watershed Transform
319
external markers (Fig. 5.28f). The points of a watershed line are along the highest points between neighboring markers. External markers segment the image into different regions, each of which corresponds to a single internal marker. The points of the external markers belong to the background. The goal of watershed segmentation with markers is to reduce each of these regions composed of pixels belonging to a single object (containing internal markers) and the background (containing external markers). Figure 5.28g shows the result of the modified gradient image (b) obtained from the internal and external markers. Finally, applying the watershed transform to the image (g) we obtain the final result of the segmentation shown in Fig. 5.28h.
5.8 Segmentation Using Clustering Algorithms Clustering algorithms (for example, K-means and mean-shift) are among the regionbased segmentation methods. So far the segmentation process has taken place in the spatial domain by analyzing the pixel attributes to assess their level of homogeneity. With clustering algorithms, the level of pixel homogeneity is evaluated in the attribute space (that is, the characteristics in literature often called features). An example is the color and multispectral images where the feature space is given by the color components (normally in 3D space) and by the N spectral components that can be N 3. Operating in the space of the features, the goal of segmentation is to analyze how homogeneous pixels aggregate in this space (cluster) and how to evaluate cluster separation in order to label homogeneous pixels belonging to the same cluster. The advantage of operating in the space of the features is immediate if the pixel attributes are discriminating among themselves (poorly correlated) generating spatially distant clusters. The disadvantage of operating in the space of the features is given by the loss of the spatial information of origin or rather the information of the proximity of the pixels useful for generating connected regions. In reality, with the segmentation algorithms based on thresholds we have already worked in the feature space, in fact for a gray level image the attribute used is precisely the gray value and the 1D space of the features was constituted precisely by the histogram of the gray level frequencies. If we consider the simple image of Fig. 5.29 we can observe how the three objects are characterized exclusively by the gray level and in the feature space it is possible to analyze how the gray levels representing the two objects (black and gray) and the background (white) are aggregated . It is therefore necessary to determine in the feature space the center of each aggregation (the gray level) representative of the three objects and label each pixel of the image plane evaluating to which of the centers, representative of the three objects, is closest. Normally they are chosen as cluster centers those that minimize the Euclidean distance i.e. the sum of squared
320
5
Image Segmentation
1000
Pixels black frame Pixels gray frame 500
White background 0 0
127
255
Fig. 5.29 Segmentation using the algorithm of clustering K-means. The objects in the image are represented by three groups of pixels: with intensity 0 the black frame, with intensity 127 the gray frame and the background with light color of intensity 230. The segmentation in this example is based on the gray level attribute choosing the three centers (0, 127 and 230) as representative of the three objects, respectively the black, gray and light background
distances (SSD)2 between all the pixels p and their center of the nearest cluster c j : SS D({p j }, {C j }) =
K
( j)
|| pi
− c j ||2
(5.57)
j=1 i∈C j ( j)
where pi is the ith pixel belonging to the cluster jth and c j is the centroid of the jth cluster C j . The centroid of a cluster can be considered the average of the pixel vectors that contain it: 1 pi (5.58) cj = Cj i∈C J
where we recall that C j indicates the number of pixel in the cluster jth. In the context of image data the vector p represents the attributes of the pixel which can include the values of intensity, color (in the spaces RGB, HSV, HSI, etc.), spectral components, and other, which are the features with that characterize the pixels. Equation (5.57) suggests that pixel aggregation can be done by knowing a priori the aggregation centers ci or by knowing a priori significant samples of each object from which the relative center can be estimated by calculating the average for each sample. In general, clusters are identified hierarchically (an approach that we will describe in Sect. 1.17 of Vol. III) or by partitioning. In a hierarchical (agglomerative or divisive) clustering algorithm the set of clusters are organized according to a hierarchical tree. A partition clustering algorithm realizes a separation of objects into non-overlapping sub-sets (clusters) in such a way that each object is represented by a single sub-set.
2 This
SSD of squared Euclidean distance coincides with the normal measure of match (similarity) formulated with the Sum of Squared Differences - SSD.
5.8 Segmentation Using Clustering Algorithms
321
5.8.1 Segmentation Using K-Means Algorithm The most-used partitioned clustering algorithm is called K-means clustering, which works well when the functional (5.57) is used as a convergence criterion with compact and isolated cluster. The key steps of the K-means algorithm are as follows: 1. I nitiali ze the iteration counter: t = 1. 2. Randomly choose a set K centroids i.e. means ctj , j = 1, K . Alternatively, centroids can be chosen randomly by pixel vectors. 3. Computes with the (5.57), for each pixel pi , SS D({p j }, {Ctj }) for each centroid j = 1, . . . , K and assigns the pixel pi to the cluster C j with value closer than the mean. 4. I ncr ease the counter t = t + 1 and update the centroids thus obtaining a new set of means: 1 (t) pi (5.59) c j = (t) |C j | (t) i∈C J
5. Repeat steps 3 and 4 until the convergence criterion is reached (the functional (5.57) is minimized), i.e. no pixel is reassigned from one cluster to another that matches C j (t) = C j (t + 1) for each jth centroid . The advantage of this algorithm is given by the easy realization and the limited computational complexity O(n), where n is the number of pixel vectors. The disadvantage is due to the dependence of the results on the initial values of the centroids. Furthermore, even if convergence is guaranteed, it is certainly not the achievement of the global minimum in the sense of least squares. The convergence of the algorithm can be anticipated by checking in step 5 if a certain percentage of vectors migrating between clusters is reached. Figure 5.30 shows the results of the K-means algorithm applied to a black and white image. In the literature there are several variants of the K-means algorithm, for example by guiding the choice of initial centroids, or by using other convergence criteria. For some applications it is also useful to apply splitting and merging algorithms downstream of the K-means algorithm.
5.8.2 Segmentation Using Mean-Shift Algorithm The K-mean approach described in the previous paragraph uses a parametric model, that is, it assumes that the density to be searched is represented by the superposition of elementary (Gaussian) distributions whose positions (centers) and shapes (covariances) are estimated iteratively. The mean-shift algorithm instead, levels the distribution of the data and finds its peaks and corresponding regions. Mean-shift therefore models the distribution using a non-parametric shape, that is, it makes no assumptions about the form of the distribution being analyzed.
322
5
Original image
K=3
Image Segmentation
K=5
Fig. 5.30 Application of the K-means algorithm to a color and monochrome image. The image is partitioned into K = 3 and K = 5 cluster defined a priori
Let xi be for i = 1, . . . , n points of the dataset of a d-dimensional space of d . In essence, we assume to have n observations (in a space d ) independently and identically distributed (IID) and that are characterized by a probability distribution f (x) not known. In other words, we are assuming that the data derives from a stochastic process of which we have observations but the probability distribution is unknown. The mean-shift algorithm aims to estimate a model of this distribution, useful for determining the groupings (clusters) of the observed data. We are therefore interested in identifying the prototypes of the possible clusters present in the multivariate data observed for real cases. Before this, it is necessary to estimate the density, and then identify the present modes. Figure 5.31b shows an example of grouping the pixels of the image of Fig. 5.31a, in the RGB color space. The displayed distribution is in fact the graphical representation of the 3D histogram which shows how the pixel population is distributed in the RGB space. The histogram in fact estimates the PDF through the population of the pixels (the observed data) by calculating their frequency in relation to the width h of the associated bin for each color component. The histogram that estimates the PDF can be seen as a non-parametric function as defined by the observations and the amplitude of the bin. When instead we assume that the observations are modeled by a known distribution, such as the Gaussian one, the PDF can be estimated by ˆ calculating the Gaussian parameters, i.e. the mean μˆ and the covariance matrix ˆ ˆ obtaining f = f (μ, ˆ ) (parametric estimation).
5.8 Segmentation Using Clustering Algorithms a)
b)
Initial distribution of pixels in space RGB
323 c)
RGB pixel distribution after mean-shift
d)
Fig. 5.31 Segmentation using mean-shift algorithm. a Original RGB image; b Distribution of pixels in RGB space before applying mean-shift algorithm; c Pixel distribution in RGB space after applying mean-shift algorithm; d The final result of the segmentation after 37 iterations with the parameters h s = 30 and h r = 32
In the treatment of images, normally deriving from stochastic processes, they are difficult to model, and a non-parametric estimate of the PDF is useful to approximate different functions. This is especially necessary when the observable data are numerous (as in the case of multi-spectral images) and strongly dependent on the drift of sensor physics. The best known non-parametric density estimation technique is characterized by Parzen3 windows [18]. The probability P that a vector x will fall in a region R is given by: P=
R
f (x )dx .
(5.60)
If instead we have n samples as indicated above, and we want to calculate the probability that k of n will fall in the region R, it will be given by the binomial law,4 in which the expected value is given by: E[k] = n P
(5.61)
where the binomial distribution for k turns out to be very peaked around the average value with the consequence that the ratio k/n is an accurate estimate of P for n → ∞. If we now assume that f (x) is continuous and that the region R is so small that it has no f variations, we can write: f (x )dx f (x)V (5.62) R
with V the volume enclosed by R. Combining the (5.60), (5.61) and (5.62) we arrive at the following estimate: k/n f (x) (5.63) V 3 See
par. 1.9.4 Vol. III for a complete description of Parzen’s window based non-parametric classifier. 4 Under the condition that {x , . . . , x } are independent and identically distributed random variables. 1 n
324
5
Image Segmentation
But now there are some practical and theoretical problems that lead to the use of Parzen windows. That is, if we set V and take many more training samples, the k/n ratio will converge in probability as desired, but in this case we would have obtained only a mediated estimate of the density f (x): f (x )dx
P (5.64) = R
V R dx If we want to get f (x) rather than its mediated version, we have to impose that V → 0. With this constraint, it may happen that if we set the number of samples n and let V tend to zero, the region will become so small that it has no point in it, producing an estimate f (x) 0 which is useless. In the same way, it may happen that one or more of the samples coincides with x by making the estimate diverge infinitely. To overcome these drawbacks, we form a sequence of partitions of our feature space in many regions R1 , R2 , . . . containing one or more f (x) points. That is, the subscript identifies the number of samples that fall in the respective region, so R1 will have only one data, R2 will have two and so on. We indicate with Vn the volume contained in Rn , and kn the number of samples falling in Rn , and f n (x) the nth estimation of f (x): kn /n (5.65) f n (x) = Vn For f n (x) → f (x) three conditions must be verified: limn→∞ Vn = 0 limn→∞ kn = ∞ limn→∞ kn /n = 0 The first condition assures us that P/V converges to f (x), the second condition assures us that the frequency ratio of (5.65) will converge to P (for f (x) = 0), and the third condition is necessary for the convergence of f n (x). Parzen windows use a way of obtaining sequences of regions satisfying these conditions, that is, by √ reducing the volume as function of the number of samples Vn = 1/ n. Considering that in the space of features d the Vn volume that encloses the Rn region is characterised by a hypercubic of side h n , we have: Vn = h dn
(5.66)
The kn number of samples that fall into the hypercube is given by the following function: 1 i f |u j | ≤ 1 j = 1, . . . , d ϕ(u) = (5.67) 0 other wise which represents a window defined on the unitary hypercube and centered in the origin. If a sample xi falls into the hypercube, then ϕ((x − xi )/ h n ) = 1, otherwise it will be zero. So the number of samples that fall into the hypercube will be: n x − xi (5.68) kn = ϕ hn i=1
5.8 Segmentation Using Clustering Algorithms
325
that replacing in the (5.65) we get: n n 1 1 1 1 x − xi x − xi = . f n (x) = ϕ ϕ n Vn hn n h dn hn i=1
(5.69)
i=1
Equation (5.69) expresses an estimate of f that we denote by f n (or in an equivalent way we can indicate it with fˆ) as the mean of functions of x and of samples xi . In fact, the window function ϕ indicating the kernel5 function represents a window function used to interpolate the data distribution, i.e., each sample contributes to the estimate of f according to the distance from x. The kernel functions used in [19] are: the Epanechnikov and Gaussian kernel. The Epanechnikov kernel has the following profile: 1−x if 0 ≤ x ≤ 1 k E (x) = (5.70) 0 if x > 1 and its multivariate version in d is given by: 1 −1 c (d + 2)(1 − x2 ) if x ≤ 1 K E (x) = 2 d 0 other wise
(5.71)
with cd representing the volume of the sphere d-dimensional6 and || • || indicates the Euclidean norm. The Normal profile is given by: 1 x ≥1 (5.72) k N (x) = exp − x 2 and its multivariate version in d is given by:
1 K N (x) = (2π )−d/2 exp − x2 2
(5.73)
So by rewriting the (5.69) we have: n ck,d x − xi 2 ˆ f h,K (x) = d k nh h
(5.74)
i=1
which is represented in Fig. 5.32, in which an estimate of the density f is reconstructed for a bimodal distribution. We are interested in searching for the modes of the f distribution, and therefore, we will evaluate where the gradient ∇ f vanishes, hence: n 2ck,d x − xi 2 (5.75) (x − xi )k ∇ˆ f h,K (x) = ∇ fˆ(x) = d+2 nh h i=1
We define the function
g(x) = −k (x)
5 Hence the name of kernel density estimation (KDE). Often these basic kernel functions (indicating
a profile) are indicated by the lowercase letter k(•). our experiments cd = 1.
6 In
326
5
Image Segmentation
Start
Fig. 5.32 Graphic representation in 1D space of the mean-shift algorithm. Given a set of points generated by two Gaussian distributions (black bars), the density function was approximated with (5.74) using a Gaussian kernel (5.73) with window width h = 1. Selecting any point from the set of samples (red cross), the mean-shift begins to estimate the next position in the direction in which f increases, up to the value at which the gradient is canceled. Note that to find all the modes, the samples visited must be deleted and a new mode search procedure must be initialized with a random sample never visited. The mean-shift ends when all the samples have been visited. Note that in the graph the density f has been calculated for a better understanding of the mean-shift operation. In real situations, the density f is unknown
which turns out to be a profile for the kernel G(x) = cg,d g x2 with cg,d the normalization constant which ensures that the kernel integrates ad 1. The kernel K (x) is called the shadow of G(x). The Epanechnikov kernel is the shadow of uniform kernel, 7 while for the normal kernel its shadow is always a normal one. Then, introducing g(x) into (5.75), replacing and differentiating we have: ∇ˆ f h,K (x) =
2ck,d nh d+2
=
2ck,d nh d+2
n
i=1 (xi
i 2 − x)g x−x h
n x−xi 2 i=1 xi g h x−xi 2 g − x n i=1 x−xi 2 h i=1 g h
n
(5.76)
mh,G (x)
Considering the (5.74) we have that the first term is proportional to the estimate of the density fˆ in x calculated with the kernel G resulting: n cg,d x − xi 2 (5.77) g fˆh,G (x) = nh d h i=1
7 In
this case g(x) = 1.
5.8 Segmentation Using Clustering Algorithms
327
The second term, instead, is the mean-shift vector mh,G (x), which is the difference between the weighted average (using the kernel G as weights) and the vector x center of the kernel (window): x−xi 2 n i=1 xi g h (5.78) x−xi − x mh,G (x) = n 2 i=1 g h Considering this, we can replace the (5.77) and (5.78) in the (5.76) obtaining: 2ck,d ∇ˆ f h,K (x) = fˆh,G (x) 2 mh,G (x) (5.79) h cg,d from which, the mean-shift vector (5.78) is rewritten as follows: 1 ∇ˆ f h,K (x) (5.80) mh,G (x) = h 2 c 2 fˆh,G (x) Equation (5.80) shows that the mean-shift vector, calculated at position x with the G kernel, is proportional to the estimate of the normalized density gradient obtained with the Epanechnikov K kernel. It follows that it is possible to obtain a normalized gradient estimate by calculating the mean-shift in a Gaussian kernel centered at x. In other words, it has been shown that PDF modes can be calculated with meanshift. The mean-shift vector is always oriented in the direction of the maximum density increment (being aligned with the direction of the estimated local gradient), this can make it possible to define a path that leads to a maximum local density, that is to a density mode (stationary point). Furthermore, as indicated above, the normalized gradient given by (5.80), introduces an adaptive behavior in the approach procedure to the stationary point, since the mean-shift vector module results great when in the low density regions (in the valleys) and decreases with the approach of x towards the modal point. Figure 5.33 shows a graphical representation of the approach procedure towards the highest density point of the data distribution (modal point) of the mean-shift algorithm. It can be observed, as from the first calculated mean-shift vector, this is already oriented towards the region with high density of data distribution. We can also see how the mean-shift vector module decreases as it approaches the modal point. The mean-shift procedure, based on the adaptive gradient ascent method, can be summarized with the Algorithm 1 for a Gaussian kernel. A sufficient condition for the convergence of the algorithm is given by the following theorem. Theorem 5.1 If the K kernel has a convex and monotonically decreasing profile, the sequences of new kernel centers y j j=1,2,... and the respective values of the density function fˆ(y j ) converge. The sequence fˆ(y j ) is monotonically j=1,2,...
j=1,2,...
increasing. The proof of this theorem is present in [19]. Among the two mentioned kernels, the one that produces a more leveled trend is the Gaussian one, to the detriment of the computational cost.
328
5
Image Segmentation
Fig. 5.33 The research phases of the local maximum in the regions with the highest density of a data distribution. Having chosen the size h of the search window Wh (x), having chosen the initial location xt where the window is centered, calculated the mean-shift vector mh (xt ) with the data included in the window, the search window Wh (xt ) is then translated and centered in the location xt+1 = xt + mh (xt ). The procedure is iterated until the vector mh (x) assumes very small values and the final location coincides with the modal point
5.8.2.1 Application of Mean-Shift for Segmentation The mean-shift algorithm can be applied considering a set of input data {xi } i = 1, . . . , n with d-dimensions that can also represent the pixels of an image. The set of convergence points (modal points) are indicated with {zi }i=1,...,n while with {Li }i=1,...,n are given a set of labels to be assigned to the points of the homogeneous regions. With these notations, the mean-shift segmentation procedure involves the following steps [19]: 1. For each index i = 1, . . . , n the mean-shift algorithm is applied and the convergence point is saved in zi . 2. Identifies the clusters {C p } p=1,...,m associated with each modal point by linking together those within a given distance in the data domain. 3. For each point i = 1, . . . , n assigns a label Li = { p|zi ∈ C p }. In essence, all the points included in the search windows that converge in the same modal point are grouped in clusters. 4. Optionally regions that contain a smaller number of M points can be eliminated. To segment a color image, joint color information and the position of the pixel in the image is used. In particular, the pixel information xs = (x, y), which is referred to as a spatial domain, is concatenated with that of color xr , which is often a trio of values of a color space (see Chap. 3 on the Color Vol. I). The mean-shift algorithm is then applied to the 5D domain. Since there may be differences in scale between
5.8 Segmentation Using Clustering Algorithms
329
Algorithm 1 Mean-shift algorithm for a normal kernel 1: The samples from which the modes are calculated are given: xi for i = 1, . . . , n 2: repeat 3: 4: 5:
yold ← random(xi ) Randomly select a starting point from the dataset repeat Calculate the next central kernel value
exp yoldh−xi 2 = yold −xi 2 n i=1 exp h n
ynew
6: 7: 8: 9: 10: 11:
i=1 xi
if ynew − yold < then break else yold = ynew end if until ynew − yold <
12: until all the samples xi were visited
color and spatial domains, the resulting kernel is modified appropriately [19]: xs 2 C xr 2 k (5.81) K h s ,hr (x) = 2 2 k h s hr h r2 h 2s where k(•) is the common profile used for both domains, h r and h s are the kernel sizes for the two domains that control the behavior of the composite filter, and C is a constant of normalization. The spatial parameter h s influences the smoothing action of the mean-shift and the connectivity of the homogeneous regions, while the parameter h r linked to the nature of the data (feature domain) influences the number of homogeneous regions. The noise in the data affects the minimum size to be accepted even if the latter depends on the geometry of the objects. Figure 5.31d shows the result of the segmentation with the mean-shift applied to the image of Fig. 5.31a, with the parameters (h r , h s ) = (32, 20). The number of required iterations was 37 and all the regions detected were left. The distribution of the RGB color space, after applying the mean-shift algorithm, is shown in Fig. 5.31c. It is highlighted how the mean-shift algorithm has identified the significant clusters in the RGB color space. Better results can be obtained by using an appropriate space to better represent the characteristics of the nature of the data. For example, RGB color images have the 3 non-linear components and the Euclidean metric may not be appropriate for assessing data homogeneity. To extract homogeneous regions in
330
5 Original image
Image Segmentation
hs=50 3 classes
hs=30 5 classes
hs=20 7 classes
hs=55 3 classes
hs=35 5 classes
hs=20 7 classes
Fig. 5.34 Application of the mean-shift algorithm to a color and monochrome image
color images the most appropriate spaces are HSI, L*a*b* and L*u*v* described in Chap. 3 Vol. I. Figure 5.34 shows the mean-shift results also applied to the monochrome image and for a qualitative comparison with the K-means algorithm the results of the segmentation are displayed by extracting 3, 5 and 7 classes for both color and monochrome images. The h r parameter is used to control the number of classes to be extracted. In the case of multispectral images the pixels are considered with d-dimensions (d = s + p) to represent the spatial domain with s-dimensions and the spectral domain with p-bands.
5.8.2.2 Application of Mean-Shift for Filtering The mean-shift algorithm can also be used for image filtering considering that the first part of the algorithm, in fact, generates a smoothing by replacing the central pixel of the window with the weighted average value of the pixels included in the search window. Compared to traditional filtering (for example the Gaussian filtering, linear filtering) mean-shift better preserves discontinuities since in high-contrast areas (edges) the smoothing action is attenuated (non-linear filtering). If we denote the input data with {xi }i=1,...,n with d-dimensions and with {zi }i=1,...,n denote the filtered points in the joint spatial-feature domain, the mean-shift filtering procedure is as follows [19]: 1. I nitiali ze k = 1 and yk = x j 2. Calculate : 1 yk+1 = nk
xi ∈S1 (yk )
xi ,
k ←k+1
5.8 Segmentation Using Clustering Algorithms
331
until convergence (yk+1 < thr eshold value; yconv ). 3. Assign z j = (xsj , yrconv ).
5.8.2.3 Application of the Mean-Shift for Tracking Recently, the mean-shift algorithm has also been applied for tracking (visual tracking) of objects (even non-rigid) moving in the context of robotics and remote surveillance [20–22]. The developed algorithms are based on the knowledge of the target object model in terms of features (for example color) and PDF (in feature space). Once defined the target model is searched in the sequence of time-varying images by evaluating with a similarity function the modal point of maximum similarity, based on the mean-shift, appropriately adapted, considering that the target model can vary over time and also in scaling. Therefore it requires an extension of the mean-shift algorithm to manage the size of the window h in the context of changing the scale of the tracked object. An adaptive scale strategy is defined to manage the time scale change of the target. The mean-shift algorithm is modified by adapting the kernel scale h and selecting the value of h that reaches the maximum similarity. For visual tracking, the use of the mean-shift with scale-space is considered using the Lindeberg theory [23] described in Chap. 6. 5.8.2.4 Conclusions The mean-shift algorithm has several strong points. It is transversal for different applications and for different types of data, even multidimensional ones. It does not use a priori knowledge of the data and from this point of view it can be considered an unsupervised clustering method (it does not require any knowledge of cluster distribution). This feature distinguishes it from the K-means that needs to know beforehand the number of clusters to be extracted. The only parameter h (kernel size) used in the mean-shift has a physical meaning. The choice of h is not easy. Inappropriate values of h can generate modal points that are insignificant or numerous that will have to be grouped later. A solution could be the use of a window with adaptive dimensions. Mean-shift is robust in outlier management. The computational load of the original mean-shift is considerable, whose complexity is O(T n 2 ), where n indicates the number of points and T is the number of iterations. K-means, on the other hand, has a lower computational load having a complexity of O(knT ), where k is the number of clusters.
References 1. K. Karhunen, Uber lineare methoden in der wahrscheinlichkeitsrechnung. Ann. Acad. Sci. Fennicae. Ser. A. I. Math.-Phys. 37, 1–79 (1947) 2. M. Loève, Probability Theory I, vol. I, 4th edn. (Springer, Berlin, 1977) 3. D.E. Knuth, Art of Computer Programming, Volume 1: Fundamental Algorithms, 7th edn. (Addison-Wesley, Boston, 1997). ISBN 0201896834
332
5
Image Segmentation
4. M.S.H. Khayal, A. Khan, S. Bashir, F.H. Khan, S. Aslam, Modified new algorithm for seed filling. Theor. Appl. Inf. Technol. 26(1) (2011) 5. T. Pavlidis, Contour filling in raster graphics. ACM Comput. Graph. 15(3), 29–36 (1981) 6. A. Distante, N. Veneziani, A two-pass algorithm for raster graphics. Comput. Graph. Image Process. 20, 288–295 (1982) 7. J. Illingworth, J. Kittler, A survey of efficient hough transform methods. Comput. Vis., Graph., Image Process. 44(1), 87–116 (1988) 8. P.V.C. Hough, A method and means for recognising complex patterns. US Patent, p. US3069654 (1962) 9. R.O. Duda, P.E. Hart, Use of the hough transformation to detect lines and curves in pictures. Commun. ACM 15(1), 11–15 (1972) 10. D.H. Ballard, Generalizing the hough transform to detect arbitrarys hapes. Pattern Recognit. (Elsevier) 13(2), 111–122 (1981) 11. L. Xu, E. Oja, P. Kultanena, A new curve detection method: randomized Hough transform (RHT). Pattern Recognit. Lett. 11, 331–338 (1990) 12. M.A. Fischler, R.C. Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981) 13. C. Lantuejoul, H. Digabel, Iterative algorithms, in In Actes du Second Symposium Europeen d’Analyse Quantitative des Microstructures en Sciences des Materiaux, Biologie et Medecine, pp. 85–99 (1978) 14. S. Beucher, C. Lantuej, Use of watersheds in contour detection, in Proceeding of Workshop on Image Processing, Real-time Edge and Motion Detection (1979) 15. S. Beucher, F. Meyer, The morphological approach to segmentation: the watershed transformation, in Mathematical Morphology in Image Processing, pp. 433–481 (1993) 16. J. Serra, Image Analysis and Mathematical Morphology (Academic Press, Cambridge, 1982) 17. L. Vincent, P. Soille, Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Trans. Pattern Anal. Mach. Intell. 13(6), 583–598 (1991) 18. R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd edn. (Wiley, Hoboken, 2001). ISBN 0471056693 19. D. Comaniciu, P. Meer, Mean-shift: a robust approach towards feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002) 20. G.R. Bradski, Computer vision face tracking for use in a perceptual user interface. Intel Technol. J. (1998) 21. P. Meer, D. Comaniciu, V. Ramesh, Kernel-based object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25(5), 564–575 (2003) 22. J.G. Allen, R.Y.D. Xu, J.S. Jin, Object tracking using camshift algorithm and multiple quantized feature spaces, in Proceedings of the Pan-Sydney Area Workshop on Visual Information Processing, pp. 3–7 (2004) 23. L. Tony, Scale-space theory: a basic tool for analysing structures at different scales. J. Appl. Stat. 21(2), 224–270 (1994)
6
Detectors and Descriptors of Interest Points
6.1 Introduction In Chap. 3 of Geometric Transformations, we have introduced the concept of fiducial points (or control points) of an image with respect to which an image has been transformed geometrically and radiometrically. This was necessary because the function of geometric transformation was not known, while the coordinates of the control points in the image to be transformed and the corresponding control points in the transformed image were known. In general, in artificial vision applications, it is useful to automatically define points of interest—PoI in an image that can be well localized and detected. A point of interest, depending on the type of application, can be represented by an isolated point characterized by a local structure of its surroundings whose information content can be expressed in terms of minimum or maximum intensity, or of texture, from a point on a contour that has a maximum value of curvature, or from the end of a line. A point of interest must be stable in the image domain even in conditions of variability of the lighting conditions and of the scene observation point (with a possible change of the scale factor). In essence, its reproducibility must be robust. In the literature, it is often used interchangeably with the terms of points of interest, key points, corner, or feature. Historically, for satellite or aerial image registration, small image windows (also known as blobs) are used as points of interest, and search algorithms have been named operators of interest point. A mobile vehicle for self-localization extracts from the image of the scene points of interest, adapted to the context, which normally corresponds to the joints of lines or corners and in this case are developed algorithms for their automatic detection also called corner detectors. In the latter case, the corners are detected by extracting the contours of objects present in the scene. These edge points, or corners of an object in the captured scene, represent structural elements for a multitude of applications ranging from the tracking of the same object in successive frames (tracking); for the determination of the correspondences between stereo images; for the location of a prototype object within a complex scene; © Springer Nature Switzerland AG 2020 A. Distante and C. Distante, Handbook of Image Processing and Computer Vision, https://doi.org/10.1007/978-3-030-42374-2_6
333
334
6 Detectors and Descriptors of Interest Points
Fig. 6.1 Examples of detection of points of interest in some applications: a Combine multiple images of parts of a scene and create a panoramic image; b Autolocalize an autonomous vehicle during navigation; c Align images of the same scene that present distortions as they shoot from mobile platforms and at different times; d Detecting homologous points of interest in the pair of stereo images
to align scene images taken from different points of view or at different times; for the 3D reconstruction of the scene; to create panoramas from image sequences that have an adequate overlap between pairs of adjacent images (see Fig. 6.1); and for several other applications. These points of interest detectors of an object must be robust, and therefore invariant, to geometric and radiometric transformations such as translation, scale, orientation, illumination variations, and invariants with respect to the observation point. Several algorithms are developed for the automatic detection of points of interest. Most of the corner detectors are based on the following principle. While an edge point is defined as a pixel of the image that exhibits a highgradient value in a preferential direction and low in the direction orthogonal to it, a corner instead is defined as an area that exhibits high gradient in several directions simultaneously. Some techniques are based on this assumption, and analyze the derivatives of the first and second order of the image in the x- and y-directions [1], or Gaussian differences [2]. The richness of the various artificial vision applications has led to the development of various algorithms to solve the general problem of automatic detection and relative correspondence of points of interest. For simplicity, these algorithms can be grouped into three categories: (a) Detector s of significant points (Feature Detectors), which are points of interest, also called key points, corner, or feature.
6.1 Introduction
335
(b) Descri ptor s of significant points (Feature Descriptors) on the basis of the pixels around significant points, i.e., considering the pixels included in the windows (patches) centered on their significant points. (c) Algorithms to evaluate the correspondence of points of interest (Feature Matchers) based on their location and description. These algorithms must be r epeatable, that is, take the set of significant points regardless of the geometric/radiometric relationship that relates the images. The points detected must exhibit a good distinguishable characteristic (distinctiveness) between them. Points must be detected with a good resolution of localization and minimizing the number of false matches among the points detected (i.e., minimize the number of points that appear similar but correspond to different physical points of the scene). The number of points detected must be appropriate to the application context, their location accurate in spatial terms and the calculation times efficient, especially in the context of dynamic scenes. Finally, a good detector must be robust with respect to the noise.
6.2 Point of Interest Detector—Moravec In 1977, Moravec [3], for the context of autonomous navigation of the Stanford vehicle, introduced the concept of Points of Interest—PoI intended as distinct regions in the sequence of images to be used to find then the correspondence of these regions in the various images of the sequence. The objective was to locate the vehicle in the environment by referring to these points of interest. In fact, the Moravec operator is considered a corner detector, since they are defined points of interest, those that have a high-intensity variation in each direction around them. In effect, this is the typical context of the corners even though Moravec’s intentions were to find distinct regions in an image and search for them in consecutive images of an image sequence and record them. Now let’s see how the variation of local intensity in a point in the image is evaluated. Let P be a point of the image, a window W with a size of at least 3 × 3 is centered and is then translated by one pixel in the eight possible cardinal directions (D = {N or d − N , N or d Est − N E, Sud Est − S E, Sud − S, Sud Ovest −S O, Ovest −O, N or d Ovest −N O}) thus generating a new window (Shifting Window). The local intensity variation for a given translation, i.e., a similarity measurement, is calculated by considering the Sum of Squared Differences—SSD of the pixels corresponding to these two windows. If we denote by S this similarity measure, with Ai, j and Bi, j , respectively, the pixels of the window under examination and those of the window translated in one of the eight directions, we obtain (Ai, j − Bi, j )2 (6.1) S= i, j
336
6 Detectors and Descriptors of Interest Points
Fig. 6.2 Detection of points of interest: Calculation of the local variation of intensity in the 8 directions D. In the figure on the left, the similarity measure S N E for the blue pixel is calculated with the translation of the shifting window of 1 pixel in the direction N E while in the figure on the right the translation is in the direction S E and the measure SS E is calculated
Fig.6.3 Detection of points of interest in three types of regions: a cor ner where there are significant variations in intensity; b edge, where there are significant variations of intensity perpendicular to the edge and almost zero along the same edge; c uni f or m where there are only weak intensity variations in all directions
where i, j indicate the pixels of the window under examination and of the translated window. Figure 6.2 displays the window under examination (blue frame) and the translated one (red frame) diagonally by 1 pixel for two examples of calculating the similarity value. For each point P considered a similarity value is associated, i.e., a measure of cornerness, equal to the minimum value of all S values calculated with (6.1) between the window under examination Ai, j and the translated Bi, j in 8 directions D. As a final result, we get a cornerness map. The cornerness measure is simply a number indicating the degree to which the corner detector evaluates that this pixel is a corner. Figure 6.3 shows the Moravec operator in action, in 3 different contexts, highlighting that he actually acts as a corner detector. In figure (a), the window in question is centered on an edge and translating the shifting window there are significant variations in any direction. In (b), the window in question is centered on a vertical edge (similarly it would result for a horizontal edge) and in this case, there are no significant intensity variations in the direction of the edge while there are high variations
6.2 Point of Interest Detector—Moravec
337
with translations perpendicular to the edge. In (c) instead the window in question is in a homogeneous region and consequently there are no significant intensity variations moving the shifting window in all directions. The Moravec algorithm for the detection of points of interest can be summarized as follows: 1. For each pixel (x, y) in the input image I is calculated the variation of intensity (i.e., similarity value) between the region under examination W (normally a square window with minimum dimensions of at least 3 pixels) and the shifting window translated in the cardinal directions (u, v) ∈ D: [I (x + u + i, y + u + j) − I (x + i, y + j)]2 (6.2) S(x, y; u, v) = i, j∈W
where the cardinal directions explained are D = {(1, 0), (1, 1), (0, 1), (−1, 1), (−1, 0), (−1, −1), (0, −1), (1, −1)} 2. Generation of the cornerness map. For each pixel (x, y) is evaluated a measure of cornerness C(x, y) considering the similarity measures obtained with (6.2): C(x, y) = min S(x, y; u, v) (u,v)∈D
(6.3)
3. Application of a threshold T to the cornerness map C(x, y) setting all the pixels with cornerness measurements below an appropriate threshold T to zero, obtaining the cornerness map C T (x, y): 0 i f C T (x, y) ≤ T C T (x, y) = (6.4) C(x, y) other wise The corner points in the cornerness map are those with a measure of cornerness corresponding to local maxima. The latter are normally numerous with relatively low cornerness measurement values and not all correspond to the physical reality of the corners. One way to reduce these false corners is to apply a threshold T to place zero points of the map below the threshold value. The threshold T is chosen in relation to the application context and following a trial and error approach. The threshold must be high enough to remove local highs that are not real corners, but adequately low to keep the local highs corresponding to real corners. In practice, it is rare to select a threshold value that allows you to remove all the false corners and keep all the real ones. In essence, a compromise value must be chosen in relation to the type of application. 4. Application of the NMS non-maximum suppression procedure. The threshold cornerness map C T (x, y), obtained with the previous step, contains nonzero values associated with the maximum local corner candidates. The NMS procedure processes each point of the map C T (x, y) by setting to zero those points that have a cornerness measure no larger than the cornerness measures of points that fall within a certain radius r . With this procedure, we obtain the C N M S (x, y)
338
6 Detectors and Descriptors of Interest Points
Fig. 6.4 Results of the Moravec operator applied to a synthetic test image and to the real image houses
cornerness map in which all points other than zero are considered local maxima corners: 1 i f C T (x, y) ≥ C T ( p, q), ∀( p, q) ∈ W (x, y) (6.5) C N M S (x, y) = 0 other wise where W indicates a window centered in the pixel under examination (x, y) and ( p, q) are the pixels examined in this window within a radius r . Figure 6.4 shows the results of the Moravec algorithm applied to the synthetic test image and to the real image house. To assess the level of similarity a 3 × 3 window was considered and a threshold was used to highlight the most significant corners. The number of highlighted corners depends on the manually applied T threshold. As highlighted above, low threshold values detect points of interest with less chance of repeatability in subsequent images, while high threshold values result in a reduction of corners but with the advantage of better repeatability of the corners.
6.2.1 Limitations of the Moravec Operator 6.2.1.1 Operator with an Anisotropic Response One of the limitations of this approach is the anisotropic response due to the calculation of the intensity variation (similarity measure) only for limited directional translations (2 horizontal, 2 vertical, and 4 diagonal). It follows that the operator is not invariant with respect to the rotation (see Fig. 6.5). This results in the nonrepeatability of the corners in subsequent images.
6.2.1.2 Operator with Noisy Response Another drawback is given by the square and binary window used which assumes a constant Euclidean distance between the pixel under examination and those at the edges of the window when in reality the real distances are different (in particular,
6.2 Point of Interest Detector—Moravec
339
Fig. 6.5 The anisotropy of the Moravec operator generates the non-repeatability of the points of interest with the rotation of the image. The example shows the results of the operator applied on the rotated test image of 35◦ e di 45◦
the distances along the diagonals). A circular window would reduce the differences in Euclidean distances. The binary window also imposes that pixels at different distances from the center have equal weight when, in reality, pixels with greater distance from the center normally have less influence in the calculation of the intensity variation. This limitation increases with the size of the square window. To mitigate this limitation, considering the assumption of attributing less weight to the pixels farthest from the center of the window, the local intensity variation is calculated using a Gaussian window. In this way, the pixels closest to the center of the window have a greater importance (according to the standard deviation of the Gaussian) in the estimation of the local variation of intensity and moreover with the Gaussian function, the window assumes the property of desired circular symmetry (at the edges extremes of the Gaussian window the weights are very small, see Sect. 9.12.6 Vol.I and the Fig. 6.6). The intensity variation previously calculated with (6.2) is instead calculated with the following: w(i, j)[I (x + u + i, y + u + j) − I (x + i, y + j)]2 (6.6) S(x, y; u, v) = i, j∈W
where w(i, j) is the matrix of the weights modeled according to the Gaussian function.
6.2.1.3 Operator Very Sensitive to False Edges Edges are normally detected in high-gradient image pixels. In the presence of noise, the Moravec operator, relying only on the gradient and choosing points of interest only considering the minimum value of S(x, y), can actually detect false corners. Furthermore, isolated pixels due to noise are erroneously detected as a corner. On the contrary, real corners that are little contrasted in the input image may not be detected. Finally, in the presence of edges oriented in a different direction than the D considered directions could be detected as a corner.
340
6 Detectors and Descriptors of Interest Points
Fig. 6.6 The weight function w(i, j) of the intensity variations: binary and Gaussian
6.3 Point of Interest Detector—Harris–Stephens The Harris and Stephens operator [1], developed in 1988, tries to overcome the limits described in the previous paragraph of the Moravec operator (currently considered obsolete). In particular, it improves the repeatability frequency of the detected corners to carry out their tracking in sequences of consecutive images. The Harris and Stephens operator is based on the fact that the points of interest (e.g., key points) exhibit a high value of the gradient in an isotropic manner (in more than one direction). In addition, it does not take into account only the classic edge points, as they exhibit a high-gradient value only in limited directions. These limits are removed by introducing a function that allows to evaluate the local intensity variation in any direction. Harris and Stephens develop an analytical formulation of the Moravec operator to derive this function. Instead of calculating the similarity function in the directions expected by the Moravec operator, we consider the expansion in Taylor series of the first term in square brackets of (6.6), which we rewrite in a more compact way: w(u, v)[I (x + u, y + u) − I (u, v)]2 (6.7) S(x, y) = u,v∈W
where the intensity variation S(x, y) is calculated for small shifts (x, y) with respect to the point under examination (u, v) of the image I . Let us now see how (6.7) can be formalized for small shifts around the point under examination to detect significant local variations in image I . We consider the development in the series of Taylor approximated to the first order of the term I (x + u, y + v) as follows: I (u + x, v + y) = I (u, v) + x I x (u, v) + y I y (u, v)
(6.8)
where I x and I y are the partial derivatives, respectively, with respect to the x- and ydirection. Substituting the approximate term given by (6.8) in the last SSD function (6.7) is obtained:
6.3 Point of Interest Detector—Harris–Stephens
S(x, y) =
341
w(u, v)[I (x + u, y + u) − I (u, v)]2
u,v∈W
≈
w(u, v)[I (u, v) + x I x (u, v) + y I y (u, v) − I (u, v]2
u,v∈W
=
w(u, v)[x I x (u, v) + y I y (u, v)]2
u,v∈W
=
w(u, v)[x 2 I x2 (u, v) + 2x y I x (u, v)I y (u, v) + y 2 I y2 (u, v)]
u,v∈W
(6.9)
I x (u, v)I y (u, v) I x2 (u, v) x x y = I y2 (u, v) I x (u, v)I y (u, v) y u,v∈W I x (u, v)I y (u, v) I x2 (u, v) x = x y I y2 (u, v) I x (u, v)I y (u, v) y
u,v∈W
The approximate bilinear function of the SSD, for small translations (x, y), rewritten in compact matrix form, results in the following: S(x, y) ≈ [x y] A [x y]T
(6.10)
where A is the matrix 2 × 2 calculated with the first partial derivatives of the image I along the horizontal and vertical directions: 2 I (u, v)I (u, v) u,v∈W I x (u, v) u,v∈W x 2 y (6.11) A= u,v∈W I x (u, v)I y (u, v) u,v∈W I y (u, v) where the elements of A are summed up locally within the W window centered in the pixel under examination. With the summation, we have the advantage of attenuating the noise present in the image. A is the autocorrelation matrix, with its properties (symmetric and semi-definite positive) and through the product of the components of the gradient I x and I y , in fact, characterizes the function S(x, y) searched expressed by (6.10). The latter can describe the geometry of the local surface of the image at a given point (x, y). Indeed, Eq. (6.10) is the quadratic form associated with the symmetric matrix A (AT = A) which represents the matrix of a structure tensor1 or local autocorrelation. The eigenvalues λ1 and λ2 of the matrix A are proportional to the curvatures of the image surface and constitute a rotationally invariant description of A (see Fig. 6.7). With reference to the cases considered in Fig. 6.2, there is a small curvature in all
1 It should be noted that the structure tensor A is actually an approximation of the covariance matrix
of the gradient of the pixels associated with the W window under examination. Therefore, the property of the covariance matrix is used to characterize the structure of the tensor by analyzing how the derivatives in the various points of the image vary through the windows W . The distribution of the data in the window being processed is analyzed through the orthonormal eigenvectors of the covariance matrix that proves to correspond to the direction of the principal components while the eigenvalues indicate the variance of the data distribution (in this case they are the values of the local gradients with respect to the pixel in processing) along the direction of the eigenvectors. .
342
6 Detectors and Descriptors of Interest Points
Fig. 6.7 Geometric interpretation of the autocorrelation matrix A of the gradients for the pixels within the window W . The size and direction of the axes of the ellipse are determined, respectively, by the eigenvalues λ1 , λ2 (that express the entity of the variation of intensity) and of the eigenvector s associated to A
directions in a uniform region, a small curvature along the edges and a large curvature transversal to the edges, a large curvature in every direction in the case of corners or in the case of an isolated point. An effective graphic representation is given in Fig. 6.8a to classify the points of interest in the 2D plane of the eigenvalues (λ1 , λ2 ) divided into 3 distinct types of regions: corner, edge, and uniform. Rotational invariance is demonstrated considering that every quadratic form is diagonalizable through a base change made with an orthogonal matrix that is changing the reference system. If you indicate with R the new base, that is, the rotation matrix, the new quadratic form would be associated with the following matrix: λ1 0 RT (6.12) A =R 0 λ2 where the eigenvectors of A are given by the columns of R, λ1 and λ2 are the associated eigenvalues. It is observed that for orthogonal matrices, that is, R T = R −1 , the matrix transformation given by the base change realized with the orthogonal matrix is a similarit y, therefore the diagonal matrix that is obtained has the same eigenvalues of A and this explains the invariance to the rotation considering that the eigenvalues proportional to the curvature exhibit the variation of the local intensity of the image. The eigenvectors of A represented by the columns of R are the new coordinate axes. The A matrix is represented graphically in Fig. 6.7 as an ellipse2
2 The
equation of the ellipse, in the domain of eigenvectors determined by R representing the new coordinated axes, is derived from (6.10), as follows: λ1 0 v1 x x λ 0 = x y S(x, y) ≈ [x y] A [x y]T = [x y][v1 v2 ] 1 0 λ 2 v2 y 0 λ2 y = λ1 x 2 + λ2 y 2
=⇒
x 2 S(x,y) λ1
+
y 2 S(x,y) λ2
=1
where (x , y ) are the coordinates of the new axes in the direction of the eigenvectors (v1 , v2 ) and the semi-axes of the ellipse are calculated for a constant value of S(x, y).
6.3 Point of Interest Detector—Harris–Stephens
343
Fig. 6.8 Classification of image pixels: a using the analysis of the eigenvalues associated with the matrix A; b using the Harris measure cornerness C
with the length of the axes determined by the eigenvalues and the orientation is aligned with the eigenvectors. For λ2 λ1 , the first eigenvector v1 has the direction of the lowest intensity variation (maximum eigenvalue) and the second vector v2 is the direction of the largest variation (minimum eigenvalue). The eigenvector encodes the direction of the edge. In fact, as shown in Fig. 6.7 for λ1 > 0 and λ2 = 0 (perfect vertical edge) the direction of the edge is perpendicular to the eigenvector associated with the eigenvalue λ1 while the maximum variation of the gradient results in the direction of the same eigenvector v1 . The calculation of the eigenvalues λi of A (given by 6.11) would be obtained from
a11 + a22 ± (a11 − a22 )2 + 4a21 a12 λi = (6.13) 2 but it would require a considerable computational load considering the square root operation. Harris and Stephens proposed instead to calculate, for each pixel of the image I (x, y), a cornerness measure C(x, y) (called corner response) given as follows: (6.14) C(x, y) = det (A) − k[trace(A)]2 where det (A) and trace(A) are, respectively, deter minant and trace of the matrix A, calculable, being able to verify that det (A) = λ1 λ2
(6.15)
trace(A) = λ1 + λ2
(6.16)
The value of the empirical constant k is generally fixed in the range 0.04 ÷ 0.06 obtaining the best experimental results. It should be noted that the measurement of cornerness thus evaluated is a function of the eigenvalues with the property of rotational invariance. Figure 6.8b reports the classification of points of interest in terms of cornerness measures C in the eigenvalue domain. For C very large (C >> 0)
344
6 Detectors and Descriptors of Interest Points
we are in the presence of cor ner (or isolated point), for C negative with high values (C 3 stable points of interest decreased). Also for the spatial resolution of the images
Fig. 6.23 Pyramidal organization of scale-space and DoG images. The pyramidal structure on the left represents the sets of images called octaves made from the images convolved with the Gaussian kernels while on the right are the DoG images
6.7 Scale-Invariant Interest Point Detectors and Descriptors
359
the initial value σ0 = 1.6 was experimentally determined which is applied to the first image of every octave of the scale-space. As shown in Fig. 6.23 the subsequent octave is obtained by subsampling the image of the current octave with a value of 2 having a value of σ2 = 2σ0 corresponding to the index i = 2 (Eq. 6.37), i.e., the central image of the octave. Normally the subsampling is realized with a bilinear resampling. The process of constructing the octaves continues compatibly with the resolution in the spatial domain that must remain acceptable also in relation to the type of images.
6.7.2.2 Localization of Points of Interest The detection of points of interest in the scale-space takes place using the DoG kernel given by (6.33) which, convolving with the image I (x, y) generates the function g DoG Difference of Gaussian given by g DoG (x, y, σ ) = [h G (x, y, kσ ) − h G (x, y, σ )] I (x, y) = h G (x, y, kσ ) I (x, y) − h G (x, y, σ ) I (x, y)
(6.38)
and remembering the function that describes the scale-space (6.29), we can rewrite it precisely in terms of difference of Gaussian images: g DoG (x, y, σ ) = gG (x, y, kσ ) − gG (x, y, σ )
(6.39)
With the Gaussian pyramidal organization, the g DoG (x, y, σ ) function is easily computable by executing the difference between adjacent Gaussian images thus obtaining the DoG pyramid (see Fig. 6.23) with s + 2 DoG images per octave. Analyzing the DoG images of the pyramid the extreme (minimum and maximum) locations present are detected as potential points of interest. A pixel of a DoG image is considered a potential point of interest if it is greater or smaller than the 8 neighboring pixels in the same DoG image and all the 9 + 9 neighboring pixels belonging, respectively, to the upper and lower DoG image of the scale-space (see Fig. 6.24). For each extreme pixel detected, the location information (x, y) and the σ scale are determined. The search for extreme points in the DoG pyramid is made by considering only s DoG images per octave excluding the first and last. The points of interest found in the scale-space are many and some of these present instability. Furthermore, given the discrete nature of the DoG pyramid, the location
Fig. 6.24 Detection and localization of maxima and minima in DoG images of scale-space
360
6 Detectors and Descriptors of Interest Points
of the extreme pixels detected is not accurate. Finally, many extreme points found are caused by noise and low-local contrast, and located along the edges and lines. The accuracy of the location of the points of interest can be improved by performing an inter polation around the extreme under examination with the use of a 3D quadratic function.
6.7.2.3 Accuracy Estimation of Extreme Points Location The 3D interpolation is realized using the quadratic Taylor expansion of the Difference of Gaussian scale-space function (Eq. 6.38). If we denote by D(x) the interpolating 3D function centered in the point under examination x = (x, y, σ ), approximating with development in series of Taylor and truncating to second order, we get the following: 1 ∂ 2 DT ∂ DT x (6.40) x + xT D(x) = D + ∂x 2 ∂x2 where D (let’s remember that it represents the Gaussian difference function g DoG (x, y, σ ) given by 6.38) and the derivatives are calculated with respect to the point under examination while x indicates the offset of the neighboring points with respect to the point under examination. To get the true location of point extrema, we minimize the previous equation by differentiating and setting to zero. Then the translation vector xˆ = (x, ˆ yˆ , σˆ ) is obtained, where the derivative of the function is zero: ∂ 2 D −1 ∂ D (6.41) xˆ = − ∂x2 ∂x If the displacement vector xˆ found is greater than 0.5 for any dimension, it is assumed that the extreme point found is close to another potential extreme point. In this case, the point in question is changed and the interpolation operation is repeated with respect to this new point. Otherwise, the translation vector xˆ is added to the location vector of the extreme point under consideration to obtain a more accurate estimate (at the subpixel level) in the new location. An analogous procedure of accuracy of the location of the extreme points in the scale-space has been proposed by Lindeberg [15] based on hybrid pyramids and operating in real time.
6.7.2.4 Extreme Points Filtering with Low Contrast In order to eliminate extreme points with low contrast, the value of the function D(x) is calculated (by using 6.40), for the translation vector xˆ , as follows: 1 ∂ DT xˆ (6.42) 2 ∂x Lowe experimentally evaluated that all extreme potential points, such that |D(x)| ˆ < 0.03, are discarded (unstable extreme points), assuming normalized in the interval [0, 1] the value of images pixels. D(ˆx) = D +
6.7 Scale-Invariant Interest Point Detectors and Descriptors
361
6.7.2.5 Filtering of Extreme Points Along the Edges Lowe has evaluated a further improvement of the stability of the extreme points by eliminating the potential points that are found along the edges that present, however, a high value of the DoG function and a location not well determined. In particular, extremes with a high value of the DoG function in only one direction (the one perpendicular to the direction of the edge) are eliminated. In analogy to the approach used by Harris, in the extreme point are analyzed the principal curvatures to determine whether they are an extreme point located on an edge. The filtering of these extreme points along the edges is carried out using the Hessian matrix: Dx x Dx y (6.43) H= Dx y D yy Indeed, from the analysis of the eigenvalues, these are proportional to the principal curvatures of the function D. It turns out that the ratio r of the two eigenvalues λ1 and λ2 , with λ1 > λ2 , and with r = λ1 /λ2 , is sufficient for analyze and discard extreme points along the edges. Let’s see how this is possible without calculating the eigenvalues. The trace T r (H) and determinant Det (H) of the Hessian matrix give, respectively, the sum and the product of the eigenvalues: T r (H) = Dx x + D yy = λ1 + λ2
(6.44)
Det (H) = Dx x D yy − Dx2y = λ1 λ2
(6.45)
Let’s now calculate the following ratio R=
T r (H)2 (r λ2 + λ2 )2 (r + 1)2 (λ1 + λ2 )2 = = = 2 Det (H) λ1 λ2 r r λ2
(6.46)
from which it emerges that it depends only on r , that is, on the ratio of the two eigenvalues (r = λ1 /λ2 ). If the two eigenvalues are almost identical, the ratio R will result in a minimum value. The value of R instead increases with the increase of the ratio r between the eigenvalues and this corresponds to high values of the absolute differences between the two main curvatures of the DoG function. Lowe [2] has experimentally evaluated that using a threshold rs = 10 for the ratio between the eigenvalues, with the following relation: R=
T r (H)2 (rs + 1)2 < Det (H) rs
(6.47)
it can be verified, if the ratio R, for the potential point of interest in examination, is greater than that determined with the threshold rs . If this occurs, this extreme point under consideration must be eliminated as it would be r < 10. It should be noted that this procedure for eliminating points of interest along the edges is analogous to that adopted for the Harris corner detector, which contemplates the curvature analysis based on the eigenvalues of the local autocorrelation function.
362
6 Detectors and Descriptors of Interest Points
Fig. 6.25 Intermediate results of the SIFT algorithm related to the detection of the extreme points in the scale-space realized with pyramidal structure consisting of sets of Gaussian images (4 octave levels) and a number of intervals s = 2. For each level of the pyramid, the number of Gaussian images is s + 3 = 5 from which s + 2 = 4 DoG images are generated
Fig. 6.26 Final results of the SIFT algorithm related to the detection and localization of points of interest in the scale-space. a Extreme points (1529) found in the DoG image; b Potential points of interest found (315) after eliminating extreme points with low contrast using the threshold of 0.03; c Points of interest remaining (126) after eliminating points along the edges with the ratio between the principal curvatures of r = 10; d Orientation of the points of interest reported in c
6.7.2.6 Points of Interest Detector Results with the SIFT Algorithm We now present some results of the SIFT algorithm, applied to the image of Fig. 6.25, relative to the detection components of local extremes in the scale-space and location of the points of interest. Figure 6.25 shows the intermediate results of the SIFT algorithm related to the detection of extreme points in the scale-space. The latter is realized with a pyramidal structure consisting of sets of Gaussian images (4 octave levels) and a number of intervals s = 2. For each level of the pyramid the number of Gaussian images is s + 3 = 5 from which s + 2 = 4 DoG images are generated containing the set of potential points of interest. In Fig. 6.26a are shown on the original image 1529 extreme points detected and localized, while in the figures (b) and in (c) are reported, respectively, the extreme points after the elimination of those with low contrast, applying the threshold of 0.03, and after the elimination of extreme points considered edge elements that have a ratio less than 10 (r < 10) between the principal curvatures.
6.7 Scale-Invariant Interest Point Detectors and Descriptors
363
6.7.3 SIFT Descriptor Component With the detector component of the SIFT algorithm, for each point of interest, a scale value, a position in the image and an orientation was calculated. It is now necessary to determine, for each point of interest found, a local descriptor with the characteristic of being invariant to rotation (with planar approximation), invariant to limited changes in the observation and illumination point. This is achieved with the descriptor component of the SIFT algorithm, divided into two subcomponents: Determination of the dominant orientation and Generation of the Descriptor.
6.7.3.1 Detection of the Dominant Orientation A first step is to assign a robust, dominant direction of the points of interest. For this purpose, we consider the σ scale of a point of interest to select the Gaussian image gG (x, y, σ ) (of the scale-space function Eq. 6.29) with the scale closest to that relative to the point in examination, and in this way all the calculations are made invariant to the scale. For the point considered gG (x, y), we consider the points of its around (small window) the selected scale σ and we calculate two quantities, approximated to the finite differences, the gradient magnitudes m(x, y) and the orientations θ (x, y): [gG (x + 1, y) − gG (x − 1, y)]2 + [gG (x, y + 1) − gG (x, y − 1)]2
(6.48)
θ(x, y) = tan −1 [gG (x + 1, y) − gG (x − 1, y)]/[gG (x, y + 1) − gG (x, y − 1)]
(6.49)
m(x, y) =
With information on the direction of the gradient in the points included in a window centered at the point of interest in question, a histogram of the local orientations to the selected scale is generated. The angular resolution of the orientation histogram is 10◦ , to include the entire interval of 360◦ , the local orientation information is accumulated in 36 cells (bin). Each sample added to the histogram is weighted by the amplitude of its gradient and by a Gaussian function (discretized with a circular window) with a σ , that is, 1.5 times larger than that of the scale of the point of interest under consideration, with respect to which the circular window is centered (see Fig. 6.27). The dominant orientation to be assigned is the one corresponding to the peak of the accumulated histogram. If there were more local dominant peaks that exceeded 80% of the absolute maximum, these points of interest are made to correspond to new points of interest with the same scale and position of the point of interest in question. Finally, for a more accurate estimation of the peak position, a parabolic interpolation is performed between the peak value and the values of the three adjacent points in the histogram. Figure 6.26d shows the corresponding dominant orientations of the points of interest shown in Fig. 6.26c.
6.7.3.2 Descriptor Generation At this level of the SIFT algorithm, each point of interest found is characterized by the following information: the position (x, y) in the image, the characteristic scale σ , and
364
6 Detectors and Descriptors of Interest Points
Fig. 6.27 Generation of the directional histogram of the local gradient from which the dominant direction of the window centered at the point of interest in question is determined. In the example in the figure, a second dominant direction is highlighted (it exceeds 80% of the maximum peak) and a second point of interest is created with the same location and scale of the point in question
the dominant local orientation θ (with possible secondary dominant directions). The next step is to create for each point of interest a descriptor that includes additional information based on the local characteristics, previously calculated, on the centered window (of adequate size as we will see below) in each point of interest. This information to be included in the descriptor must be such to generate points of interest with the following attributes: (a) Invariant as possible to changes in the point of view. (b) Invariant to the change in illumination (also nonlinear). (c) Highly distinguishable and repeatable. To make descriptors with the characteristics described above, Lowe [2] obtained the best experimental results by selecting the Gaussian image gG (x, y, σ ) with the scale corresponding to that of the point under consideration and considering the information gradient-orientations, calculated in the previous step, included in the neighborhood of the point of interest, and in particular consider a window of size 16 × 16 centered in the point of interest in question. To obtain the invariance to the rotation, the coordinate of the descriptor and the orientations of the local gradient, associated with this window, are rotated with respect to the dominant direction of the point of interest in question. The amplitude of the gradient of each sample of the considered window is weighted with a Gaussian function with σ equal to half of that of the scale of the point under examination. The purpose of the Gaussian function is to give greater weight to the samples closer to the position of the point of interest in question with respect to which the descriptor is centered. The 16 × 16 window is partitioned into 4 × 4 sub-windows (see Fig. 6.28) each comprising 4 × 4 samples from which the orientation histograms are calculated. In the histograms accumulated for each sub-window, the orientations are discretized in 8 intervals (8 corresponding bins) each with a resolution of 45◦ . Each element of these histograms is represented by a vector characterized by the direction (eight possible orientations) and by the module corresponding to the value of the
6.7 Scale-Invariant Interest Point Detectors and Descriptors
365
Fig. 6.28 Generation of the descriptor vector for each point of interest. The figure shows how starting from the gradient image, chosen in the scale-space in relation to the σ of the point of interest in question, we consider a window of 16 × 16 samples centered in the point of interest, is then partitioned into 4 × 4 sub-windows, from which 16 local orientation histograms are created, each accumulated with the angular resolution of 8 orientations (8 bins, each including an angle of 45◦ ). It follows a descriptor vector of 16 × 8 = 128 elements that is characterized by 128 local orientations
gradient. To improve the accuracy of these local histograms, a trilinear interpolation is performed to distribute the gradient values of each point even in adjacent bins. In particular, the value of each bin is multiplied by an additional weight coefficient of 1 − d, where d is the distance between the sample and the central position of the bin (expressed in width units of the bin in the histogram). The resulting descriptor, in fact, turns out to be a vector that includes the set of bins of all the histograms of the sub-windows, i.e., 4×4 = 16 histograms each of 8 bins. It follows, that for each point of interest a descriptor characterized by 4 × 4 × 8 = 128 elements is associated, each representing the direction and amplitude information of the gradient.
6.7.3.3 Contrast Invariance The SIFT descriptor thus obtained is made more robust with respect to changes in lighting conditions by performing two normalizations of the descriptor vector. With the first, all the elements of the vector are normalized to 1 and in this way the information of the gradient becomes invariant to the local transformations related to the intensity of the image around each point of interest. In fact, if the pixel intensities are changed by a constant factor (due to the contrast variation), the normalization process obtains the independence from the consequent variations of a constant factor of the gradient (gain). Analogously we have the invariance from the contrast when the intensity of the pixels is modified with the addition of a constant factor since the value of the gradient is not influenced as it is calculated with the finite difference of the intensity of the pixels.
366
6 Detectors and Descriptors of Interest Points
The second normalization limits the effect of nonlinear lighting changes, generally caused by sensor saturation or changes in lighting depending on the 3D nature of the objects and changes in viewpoints even if limited. Such changes in illumination substantially change the value of the gradient of the descriptor vector and much less the local orientation. Therefore, the values of the gradient are thresholded by eliminating all the elements of the vector that exceed the threshold of 0.2 and then a new normalization of the vector at 1 is performed.
6.7.3.4 Biological Motivation of the SIFT Descriptors The SIFT descriptors, based on the local information depending on the position and the histograms of the orientations of the gradient can be assimilated as measures that mimic the receptive fields of the complex cells in the primary visual cortex. From the studies of Hubel and Wiesel [16] emerges that the cells are sensitive to the orientation of the edges and almost insensitive to their position. Furthermore, it is known from studies on biological vision how the receptive fields of the retina and of the visual cortex are modeled in terms of Gaussian derivatives as proposed in the works concerning artificial vision [17,18]. These studies have shown that the receptive fields capture information concerning the reflectance characteristics of the surface of the objects (see Sect. 2.3 Vol.I), useful for the recognition of the same. The characteristics of invariance of the SIFT descriptors on scale, rotation, and lighting, and the moderate change of points of view, allow their use in the various applications of artificial vision and in particular for the recognition of objects (each object in the scene is characterized by the set of descriptors associated with the points of interest detected), the stereo vision and motion analysis for which it is necessary to match the descriptors (correspondence problem) of the points of interest found in the sequences of varying space–time images. A distance-based functional (for example, the Euclidean one) is used to evaluate the similarity of the descriptor vectors for the correspondence problem and the object recognition problem.
6.7.3.5 Dimensional Adequacy and Computational Complexity of the SIFT Descriptors The descriptor vectors SIFT defined with a dimensionality of 128 elements seem a fair compromise between computational complexity and results achieved in particular in the stereo vision and motion analysis, and in the recognition of an object in a vast archive of objects where the calculation is required of correspondence for a large number of points of interest. Moreover, it has been experimentally verified that by reducing the dimensions of the descriptor vector the results of the correspondences are compromised, there is a significant decrease in the number of corresponding points. On the other hand, by increasing the length of the description vector, there are no substantial improvements while performance deteriorates in the presence of distortions and occlusions. Despite the length of 128 elements, good performances are obtained also in applications operating in real time thanks to the containment of the computational com-
6.7 Scale-Invariant Interest Point Detectors and Descriptors
367
Fig. 6.29 Application of the SIFT algorithm using the Lowe software to find the correspondence of points between pairs of images. a Corresponding points in the context of recognizing plane objects; note the property of the invariance of the SIFT points on rotation and scale. b Corresponding points found between stereo image pair
plexity resolved with the BBF—Best Bin First algorithm for the optimal selection of the corresponding points [19]. To minimize the number of ambiguous correspondences, considering the diversity of the images, a correspondence is accepted, that is, the comparison (match) between two descriptors, when the ratio between the distances of the nearest and the next neighbor is less than 0.8. This criterion has proved to be very effective particularly in object recognition applications. With several experiments, it was found that the accuracy of the correspondence is around 50% if the images are acquired with differences in viewpoints of 50◦ . In the presence of images with moderate affine transformations, the accuracy of the matches also reaches 80%. Figure 6.29 highlights the test results of the SIFT algorithm to solve the problem of the correspondence of the points of interest applied in the context of plane object recognition and in the context of the stereo vision. In figure (a), the points of interest are found first, independently on the two images, and then the correspondences are found to identify an object characterized by the SIFT descriptors (used Lowe’s software). Figure (b) shows the matches found in a pair of stereo images. The scalespace was constructed with s = 2 intervals to significantly reduce calculation times. In the left image (of size 398 × 300), 2520 points were found, while on the right one, 2493 points were found. The number of corresponding points found is 1471 and of these only 241 are displayed for visibility reasons. The pairs of points that have exceeded the match criterion indicated above are displayed by drawing a line. The score of the corresponding points found is calculated with the ratio between the number of correct matches and the minimum number of points of interest found in the pair of images. In the example considered the ratio is 1471/2493 = 0.59. At the graphical level, the correct correspondence is observed with the nonintersection of the lines. In Fig. 6.30 are shown instead the results of the correspondence of the points SIFT between images with different scales of a factor two. In the smaller image (with dimensions 199 × 150), 695 points were detected while on the larger one 2526 points are detected. The number of corresponding points found that exceeded the match criterion was 329 (all displayed with red lines). The score of the corresponding points found is calculated with the ratio between the number of correct matches and
368
6 Detectors and Descriptors of Interest Points
Fig. 6.30 Application of the SIFT algorithm using the Lowe software to find the correspondence of points between a pair of images scaled between them by a factor of two
the number of points found in the smallest image resulting in 329/695 = 0.47. The lower score, compared to the stereo pair, is due to the lower number of points found in the smaller image.
6.7.4 GLOH Descriptor This descriptor with the acronym of Gradient Location and Orientation Histograms [20] is an extension of the SIFT descriptor. As the SIFT is composed of a set of orientation histograms of the local gradients. The local region instead of dividing it into sub-square regions is subdivided into circular regions with 17 spatial bins with each of which consisting of 16 orientation intervals, thus generating the descriptor vector of 272 elements (see Fig. 6.31). To create a GLOH descriptor, we consider a circular window of 15 pixel radius centered at a point of interest. This window is divided into three concentric circular sectors of radius 6, 11, and 15 pixels. The circular sectors of radius 11 and 15 pixels are angularly subdivided into 8 radial sectors thus realizing a partition of the entire circular window of 2 × 8 = 16 spatial bins which added to the central sector there are in total 17 spatial bins. For each of the 17 spatial bins, the dominant direction is calculated in relation to the direction of the gradient grouped in 16 directions. The pixels of each spatial bin contribute to each of the 16 possible directions thus obtaining a vector of 17 × 16 = 272 elements. By applying the principal components analysis (PCA), descriptor vectors are normally reduced up to 128 significant elements as in SIFT. GLOH is more robust than SIFT in structured scenes and in particular shows a high distinguishability while SIFT shows good performance with high texture images. In some object recognition applications, it shows a sufficient performance also reducing the descriptor vector up to 42 elements in analogy to the PCA-SIFT. A detailed comparative evaluation of GLOH with other methods and in various application contexts is reported in [21].
6.8 SURF Detector and Descriptor
369
Fig. 6.31 GLOH descriptor, an extension of the SIFT descriptor, is organized in a grid in logarithmic-polar locations with a radial resolution of 3 bins (with a radius of 6, 11, and 15) and an angular resolution of 8 intervals, resulting partitioned into 2 × 8 + 1 = 17 spatial bins. Each spatial bin resolves 16 orientations of the gradient with the particularity that the central one is not divided into angular directions. The descriptor develops a total of 16 × 17 = 272 elements for each point of interest
6.8 SURF Detector and Descriptor Speed-Up Robust Features (SURF) [22,23] is a detector and descriptor of interest points based on the same principles as the SIFT detector. It is divided into two components, a detector and a descriptor, of points of interest with the characteristic of being invariant to the rotation, translation, and the scale change. The two components, detector and descriptor, can operate independently.
6.8.1 SURF Detector Component SURF as SIFT organizes the scale-space in octaves using the Gaussian function with scale parameter σ . The images (called maps, contains the detector’s answers) of each octave are spatially scaled uniformly in levels that cover a range of 2σ . SURF as SIFT constructs the pyramidal scale-space with different octave levels and the points of interest are extracted considering the extrema points between the 8-neighbors of the point under examination in the current level and its 2 × 9-neighbors of the adjacent upper and lower levels (see Fig. 6.24).
6.8.1.1 Integral Image A substantial difference of the SURF detector with respect to the SIFT regards the approximation modality of the Hessian matrix and the use of the Integral Images introduced the latter in [23–25] (in Computer Vision applications), to reduce the calculation time considerably and to have a detector operating in real time.
370
6 Detectors and Descriptors of Interest Points
Fig. 6.32 Intensity sum calculation with integral image. a Integral image construction I from the original image I ; b Calculation of the sum S of the intensities within an arbitrary rectangular region; c Graphical representation of the calculation (with only 3 operations) of the summation of the intensity of the arbitrary region known only the 4 vertices of the rectangle and using the corresponding 4 values of the integral image
The integral image is obtained with an algorithm that reorganizes the original image so as to make the calculation of the sum of the intensity values of arbitrary rectangular portions very fast (i.e., calculate the rectangular area). Given a generic image I , the value of the integral image I(x, y) represents the sum of the pixel value of the original image contained within the rectangle, constructed on the image I , whose main diagonal is the point of origin (0, 0) and the point under consideration (x, y): y x I (i, j) (6.50) I(x, y) = i=0 j=0
where the ends of the rectangle are included in the summation (see Fig. 6.32a). The integral image I(x, y) can be calculated with a single reading of the original image I using the following expressions: S(x, y) = S(x, y − 1) + I (i, j)
I(x, y) = I(x − 1, y) + S(x, y) (6.51)
where S(x, y) represents the cumulative sum per row, I (x, y) the value of the original pixel, and is assumed S(x, −1) = 0 and I(−1, y) = 0. With the computational artifice of using the integral image I, it is possible to optimize, with the cost of only 3 sum operations, the calculation of the summation
of the intensity of all the pixels of an arbitrary window rectangular of the original image I , as follows:
= I(D) + I(A) − I(B) − I(C)
(6.52)
where for simplicity the vertices of the rectangle on the integral image are indicated with A, B, C, and D (see Fig. 6.32b). In these points, the integral values are already pre-calculated, according to (6.50), and for the calculation of the summation , these vertices are used as look-up table indexes (see Fig. 6.32c) to take the integral values to be used in (6.52). From this last consideration, an important property of the integral image follows, that is, the calculation time of the summation is independent
6.8 SURF Detector and Descriptor
371
of the dimensions of the rectangular window. This property will be well exploited in SURF in the calculation of convolutions with appropriate kernels that, despite having different dimensions, the calculation time of the relative convolutions remains constant (as we will see later).
6.8.1.2 Fast Hessian—FH Detector The second important choice of the SURF detector is the use of the Hessian matrix to accurately determine the location of the points of interest in correspondence, where the determinant assumes the maximum value. Recall that the second derivatives used in the Hessian matrix are useful for characterizing the local geometry around points of interest. In fact, the determinant of the Hessian matrix has the property of coinciding with the product of the eigenvalues of the Hessian matrix (see Sect. 6.5) and can be used to evaluate the extreme point. If the determinant assumes a negative value, then the eigenvalues have a different sign and the point in question is not extreme; if positive, then both eigenvalues can be positive or negative and for both situations, the point is detected as an extreme point. SURF also uses the determinant of the Hessian matrix H (x, σ ) as a basis for selecting the characteristic scale of the points of interest which we rewrite here as follows: L x x (x, σ ) L x y (x, σ ) (6.53) H (x, σ ) = L x y (x, σ ) L yy (x, σ ) where the point under consideration is indicated with x = (x, y), L x x (x, σ ), L yy (x, σ ) and L x y (x, σ ) are the second derivatives of the image I (x) convolved with a Gaussian kernel with standard deviation σ given by (6.61). The Gaussian function in different scale-invariant detectors is used as a multiscale function and the convolution kernel must be adequately discretized and sized the support window in relation to the scale parameter σ (see Fig. 6.33a). This discretization involves the introduction of artifacts in the convolved image, in particular for small Gaussian kernels (see Sect. 9.12.6 Vol.I for a quantitative measure of such artifacts). SURF introduces a further approximation using a kernel based on a rectangular function (box) as shown in Fig. 6.33b. The approximate convolutions L x x , L yy , and L x y are evaluated with an optimized computational cost thanks to the use of the integral image with the advantage of being also independent of the kernel size. Experimentally, it is shown that, despite these approximations, the performance of the detector, in determining the points of interest, are comparable with those obtained using discretized Gaussian kernels. The 9 × 9 kernels shown in Fig. 6.33b are obtained with approximate Gaussians with σ = 1.2 and constitute the smallest basic scale (i.e., the one with the best resolution) to calculate the map of the potential points of interest. The derivatives of the second order of the Gaussian approximated with the rectangular function are indicated with Dx x , D yy and Dx y . To simplify the calculation of convolutions the weights of the relative kernels are chosen simple (integers) producing an approximate value of the
372
6 Detectors and Descriptors of Interest Points
Fig. 6.33 Kernel of the Laplacian of Gaussian. a The first row shows the discretized versions of the second-order Gaussian derivatives L x x , L yy and L x y , respectively, in the direction of x, y, and diagonal x y; b The second line shows the corresponding approximate values indicated with Dx x , D yy and Dx y ; c Construction of kernels scaled from 9 × 9 to 15 × 15 (displayed only D yy and Dx y ) while maintaining the geometry of the lobes in proportion
normalized determinant of the Hessian matrix: Dx x (x, σ )D yy (x, σ ) − (w Dx y (x, σ ))2 , (6.54) σ2 where the weight w is calculated with the following ratio between the approximate filters and the corresponding discrete Gaussians: Det (Happr ox ) =
|L x y (1.2)| F |D yy (9)| F 0.9 |L yy (1.2)| F |Dx y (9)| F
(6.55)
This weight is used to further balance the weight of the kernel boxes in the expression (6.54) of the determinant of the Hessian matrix. This is necessary for maintaining the information content between the Gaussian kernels and the approximate Gaussian kernels, that is, to reduce the error introduced with the approximation. The terms |•| F represent the matrix norm of Frobenius: |A(x, σ )| F ≡ |A(x, σ )|2 x
The weight w is assumed to be constant, although from a theoretical point of view it depends on the scale parameter σ as it results from (6.54), having experimentally verified a minimal impact on the results. The approximate value of the determinant expressed by (6.54) is used as a representative measure of the points of interest (blob in this case) at the location x. The answers obtained with (6.54) produce the multiscale maps of the blobs and the local maxima are determined in a similar way to the SIFT approach which consists in analyzing the scale-space but considering the use of the integral images. Figure 6.34 shows the pyramidal representation of the scale-space used in SIFT and SURF. It should be noted that in SIFT the original image is directly convoluted with the Gaussian kernel and repeatedly resized (to generate
6.8 SURF Detector and Descriptor
373
Fig. 6.34 Construction of the scale-space. In SURF, it is realized with integral images and using approximate LoG filters with arbitrary and incremental dimensions while traditionally, as in SIFT, the image is subsampled in every octave and an increasingly larger Gaussian kernel is used
octave images) to obtain smooth images and then derive the multiscale images DoG (Difference of Gaussian) from which to extract potential points of interest. In SURF instead, with the use of integral images and kernel boxes, it is not necessary to apply the same filter to the output of different levels of previously filtered images but you can apply kernel boxes with different sizes (keeping the filter lobes in proportion) directly on the original image. The calculation time of each convolution, as the kernel size changes, remains constant and even the convolution process can be parallelized. Therefore, the scale-space is analyzed by increasing the size of the base filter instead of subsampling the original image iteratively (see Fig. 6.34). Figure 6.33c shows how increasing the size of the kernels preserves the filter structures with the lobes scaled adequately in proportion. Returning to the representation of the scale-space, each octave will have associated the results of the convolutions of the original image filtered with the kernel boxes with increasing dimensions. The first octave will have associated the results of the filtering relative to the initial scale using the approximate Gaussian derivative filters (Fig. 6.35) with scale parameter σ = 1.2. Subsequent levels are obtained by always filtering the original image with gradually growing kernel box sizes. An octave covers a scale interval equal to a factor of two and this implies increasing the kernel size by at least twice the initial value. This scale interval, in each octave, is uniformly divided and given the discrete nature of the integral image, the minimum scale difference between two contiguous scale levels depends on the size of the positive and negative lobes of the second approximate Gaussian derivatives in the direction along the x- or y-axis, which is set to 1/3 of the kernel size. Starting with the 9 × 9 kernel, the length of the lobe is 3 pixels. It follows that for two successive levels the size of the lobe must increase by 1 pixel per side, and therefore the size of the kernel increases by 6 pixels. Therefore, in the first octave the 4 kernels have the dimensions of 9, 15, 21, and 27. The scale value s (to be associated with each approximate point of interest (x, y)) corresponding to each octave level is, therefore, related to the size of the current filter Dim F ilter C orr
374
6 Detectors and Descriptors of Interest Points
Fig. 6.35 Graphical representation of the size of kernels in relation to the scale parameter σ (in logarithmic scale) for 3 different octaves (optional the fourth). Note the overlap of the octaves to ensure complete continuity coverage of each scale
and is calculated as follows: Dim_Filter _Curr · Scale_Filter _Base 1.2 s= = Dim_Filter _Curr · Dim_Filter _Base 9 In applying the N M S algorithm, in the 3D scale-space context, the points of interest extracted can be those of the second and third level of each octave, since the extreme levels are used only to interpolate and compare between them the potential points of interest in the spatial domain and scale (see Fig. 6.24). Therefore, considering a possible interpolation between the first two levels of the first octave, with filters of the size of 9 and 15, respectively, the minimum possible scale, imagining intermediate filtering with a filter of dimension Dim_Filter _Curr = 12, would result in s = 12(1.2/9) = 1.6. With analogous reasoning we calculate the largest scale, relative to the first octave, interpolating in half between the images filtered with filters of dimensions 21 and 27 obtaining the coarser scale value of s = 24(1.2/9) = 3.2. It follows, after 3D interpolation, that the finer scale is s = 1.6, and the coarser one is s = 3.2 covering a scale interval of a factor 2 for the whole octave. The same reasoning is applicable to the other octaves. For each subsequent octave, the kernel size increases in double starting from 6, 12, 24 up to 48 (see Fig. 6.35). At the same time, the sampling intervals for the extraction of the points of interest can also be doubled for each new octave. This involves a reduction in the calculation times with the consequent loss in accuracy which is, however, comparable with the methods which are based on the subsampling of the image. The number of octaves used in SURF is 3 with optionally a fourth. Figure 6.35 shows the 4 kernel sizes of the second and third octaves. The octaves partially overlap so as to cover all possible scales with continuity and to improve the quality of the interpolation results.
6.8.1.3 Interest Point Localization Constructed the scale-space with the octaves containing the maps of the potential points of interest (x, y, s) at various scale levels, it is now necessary to select the significant ones. A first filtering is done through a threshold that eliminates points
6.8 SURF Detector and Descriptor
375
below a certain threshold, normally generated by noise. Then the NMS algorithm is applied by analyzing the 3D scale-space as in SIFT (see Fig. 6.24). Each point of the two central maps of each octave is analyzed by comparing it with its 8 neighbors at the same level and the 9 + 9 nearby of the corresponding adjacent upper and lower levels. The estimation of the accuracy of the location of the points of interest is realized by considering the responses of the Hessian function H (x) (that is, determined by the approximate Hessian matrix), where x = (x, y, s) is the point under consideration centered around points 3 × 3 × 3 above mentioned. The 3D interpolating function H centered in the point under examination x = (x, y, s), is obtained approximating with the Taylor series development and truncating to the second order as described in paragraph SIFT (Eqs. (6.40) and (6.41)). With this 3D quadratic function, we arrive at an accuracy of the subpixel and subscale, useful in particular by interpolating in the coarse area of the scale-space (considerable difference in scale between one octave and the next). An analogous procedure of accuracy of the location of the extreme points in the scale-space has been proposed by Lindeberg [15] based on hybrid pyramids operating in real time.
6.8.1.4 Fast-Hessian Detector Results Figure 6.36 shows the results obtained with the SURF detector using the Fast-Hessian version obtained with the initial filter of size 9 × 9. Compared with the Harris– Laplace, Hessian–Laplace and DoG detectors, the same number of points detected and the same repeatability can be found on average. In terms of efficiency, it is 3 times faster than the DoG and 6 times faster than the Hessian–Laplace. Figure 6.37 shows the results obtained on images with affine deformations (composed of rotation of 15◦ , shear of 115◦ in the direction of x and reduced to a scale of a factor 2). Acceptable results are found even in the presence of similar deformations, although the SURF algorithm is invariant only to scale and rotation.
Fig.6.36 Results of the SURF Fast Hessian detector. Points of interest determined for three different types of images
376
6 Detectors and Descriptors of Interest Points
Fig.6.37 SURF detector results in the presence of affine deformations. a The left image is deformed with a 15◦ rotation and a 15◦ shear in the x-direction; b Result of the registration on the basis of homologous points found in a; c The left image is deformed as in a but also scaled by a factor of two; d Result of the image registration of the case (c)
6.8.2 SURF Descriptor Component With the detector component of the SURF algorithm, for each point of interest, the scale value s and its position (x, y) in the image was calculated. Now, for each point of interest found, a local descriptor is defined with the characteristic of being invariant to rotation (with planar approximation), invariant to limited changes in the observation and illumination point. The goal is to define a unique and robust description of each point of interest through the local information of its surroundings. The SURF descriptor describes the intensity distribution of the local region centered at the point of interest based on the Haar transform while the SIFT and other approaches are based on local gradient information. In particular, SURF uses the Haar transform (see Sect. 2.9) to calculate the dominant direction of the point of interest by exploiting computationally efficient integral images. The SURF descriptor is calculated in two steps. In the first, the dominant direction of each point of interest is estimated and in the second, for each of these, a descriptor vector of 64 elements is generated which describes the local information.
6.8.2.1 Calculation of the Dominant Direction It is calculated using Haar’s wavelet transform in the direction of x and y having as support a circular window of radius 6 · s (where s is the scale on which the point of interest was determined) centered at the point of interest (see Fig. 6.38a). The Haar transform is calculated for each point of this circular window at the current scale s. This filtering operation is made quick by using the integral image property. The wavelet filters used are shown in Fig. 6.38b. The transform is realized with 6 operations to obtain the filter response in the direction x and y whatever the current scale. The size of the wavelet filters is 4 · s. Then the filter responses are weighted with a Gaussian function (with σ = 2 · s) centered at the point of interest. The answers are shown in Fig. 6.38c as points in the Cartesian plane (x, y) thus reporting the value x of the transform on the homologous axis and similarly for the response along the y-axis. The points closest to the point of interest have the greater weight due to the Gaussian function.
6.8 SURF Detector and Descriptor
377
Fig. 6.38 Determination of the dominant direction. a For each point of interest the pixels included in the circular window with a radius of 6 · s are analyzed. b For each of these pixels, the Haar transform is applied to calculate the response in the direction of x and y using the Haar box filters of size 2 · s; c The dominant direction is determined through a conical window with an opening of π/3 that weighs with a Gaussian function of 2σ Haar’s responses included in this conical window that slides around the point of interest. The dominant direction is the resulting one with a greater total weight (purple arrow)
Fig. 6.39 Construction of the SURF descriptor vector. a Display of some windows of the descriptor at different scales centered in the point of interest and oriented in the dominant direction calculated in the first step; b The descriptor of each point of interest is defined by centering on each of them a square-oriented window of size 20·s divided into 4 subregions each of 5×5 pixels to which the Haar transform is applied. Each subregion is characterized by Haar’s wavelet responses accumulated in 4 summations of the terms dx , |dx |, d y , and |d y | obtaining a total of 64 terms for each descriptor considered for each point of interest (shown in the final result, for simplicity, subdivisions of 2 × 2)
The estimated dominant direction is given by the sum of each answer computed within a window (circular sector) that rotates covering an angle of π/3. The answers within the circular sector, along the x axis, are summed, and similarly those along the y-axis are added. These two resulting summed answers give the local orientation vector. The vector with the largest module (that is, the largest one) is chosen as the vector that defines the dominant direction of the point of interest in question. The angular opening of the circular sector must be suitably chosen. Small values associate the direction corresponding to the gradient as the dominant direction, while large values produce meaningless directions.
378
6 Detectors and Descriptors of Interest Points
Fig.6.40 Descriptor behavior for 3 local intensity variation configurations considering a subregion. In the case of a homogeneous region (left), all 4 summation values are low, in the presence of texture (frequencies) in the direction of x (central figure) the value of the sum |dx | is high while all other values are low; if the intensity gradually increases in the direction of x (right), both summation values dx and |dx | are high while the other two are low
6.8.2.2 Calculation of the Descriptor Vector The descriptor vector is constructed by initially considering a square window of size 20·s centered at the point of interest and oriented in the dominant direction previously calculated (see Fig. 6.39a). Subsequently, it is divided into 16 subsquare regions and for each of these, the wavelet transform (of size 2 · s) of Haar is applied for 5 × 5 uniformly spaced points. We indicate with dx and d y the responses of the transform (defined with respect to the point of interest under examination with the oriented window) along the respective axes x and y. The integral image is not considered in this transformation. To improve the accuracy of localization and robustness with respect to geometric transformations, the answers dx and d y (relative to the 25 points of each subregion), are weighted with the Gaussian function (with σ = 3.3s) centered on the point of interest in question. For each of the 16 subregions the horizontal and vertical wavelet responses and their absolute values are summed (see Fig. 6.39b) thus producing a descriptor vector of 4 elements: |dx |, dy , |d y | (6.56) v= dx , Equation (6.56) is applied for all 16 subregions and concatenating all the elements together we obtain a descriptor vector of 4 × 16 = 64 elements that characterize the point of interest in question. The SURF descriptor, based on the wavelet responses, is invariant to the translation (additive offset) of the intensities while the invariance to the contrast (change of intensity to less than a scale factor) is achieved by normalizing the elements of the descriptor vector with respect to unity. Figure 6.40 shows the descriptor behavior for 3 local intensity variation configurations considering a subregion. In the case of a homogeneous region, all 4 summation values are low, in the presence of frequencies in the direction of x the value of the sum sum|dx | is high while all other values are low. If, on the other hand, the intensity gradually increases in the direction of x, both summation values dx and |dx | are high while the others are low. More complex local configurations would be distinguishable thanks to the information captured in the descriptor vector of 64 elements.
6.8 SURF Detector and Descriptor
379
Compared to the SIFT descriptor, this descriptor is more robust to noise due to its peculiarity in not considering the direction of the gradient individually. The SURF descriptor can be extended with vectors up to 128 even if experimental results show limited advantages with the disadvantage of the increase of the computational load. A version of SURF called U-SURF is used without calculating the dominant direction for each point of interest. Experimentally U-SURF has proved to be performing for rotations within 15◦ in object recognition applications. SURF was also tested with descriptors of 36 elements corresponding to the use of 9 subregions. In this case, we note a degradation of the performances with the benefit of a lower computational load in particular in the calculation of the correspondences due to the reduction (halved) of the length of the descriptors. The authors of SURF also tested a version in which the dx and |dx | summations were calculated separately for d y < 0 and d y > 0. Similarly, the summations d y and |d y | have been subdivided according to the sign of dx , and thus doubling the features of descriptor vector. The results of this version have shown a better distinctiveness of the points, a good computational speed but a smaller number of correspondences due to the high dimensionality of the descriptor. Color images can be used in SURF with a 3-factor increase in the length of the descriptor vector. The dimensionality of the descriptors, in the case of color images, could be reduced by applying the principal component analysis transform (PCA transform, see Sect. 2.10.1). The use of the PCA instead of the normal SURF descriptors does not produce any advantage.
6.8.2.3 Fast Addressing for the Correspondence Problem We recall that finding a correspondence between two descriptors is a matter of testing if the distance between the two vectors is sufficiently small. An important improvement can be obtained, in support of the correspondence problem, by using the Laplacian sign, which can be calculated during the detection of points of interest. The sign of the Laplacian corresponds to the trace of the Hessian matrix: ∇ 2 L = tr (H ) = L x x (x, σ ) + L yy (x, σ )
(6.57)
Therefore, since the Hessian matrix is calculated for each point by the SURF detector, preserving at this stage the Laplacian sign has no additional cost. The sign of the Laplacian distinguishes locally if the region centered at the point of interest in question is clear on a dark background or vice versa it is dark on a light background (see Fig. 6.41). This information, used during the correspondence evaluation phase, is important to verify first whether the potential points of correspondence under examination have, or not, the same contrast. With this simple information, the comparison of the descriptors to evaluate the correspondence is made only if these have the concordant sign with the considerable advantage of reducing the computational cost and without affecting the ability to distinguish the descriptors.
380
6 Detectors and Descriptors of Interest Points
Fig. 6.41 Pre-filtration of potentially nonhomologous pairs of points of interest. The verification of their correspondence is avoided if their contrast is inconsistent: clear region on a dark background against dark region on a light background
6.8.2.4 Results of the SURF Detector-Descriptor SURF and SIFT have comparable overall performance although SURF is faster (by at least a factor of 3). SURF achieves good results on blurred and rotated images but is less efficient in conditions of changing lighting and points of view. The evaluation of the SURF descriptor is based on a recall–precision criterion which considers the results of the correct and false correspondence between potential homologous points of interest found between two images. The evaluation is represented graphically by comparing the values of recall toward precision. Recall is the ratio between the correct number of matches and the total number of matches. Precision is the ratio between the number of false matches and the sum of false matches and the number of correct matches. The metric to evaluate the similarity of the descriptor vectors is based on the Euclidean distance adopting an adequate threshold. Another methodology (closer relationship) for assessing similarity is based on the closest distance of a point of interest with more corresponding potential points. In this case, they are considered to be corresponding if the ratio between their distances and that of the nearest second is less than a certain threshold. The latter method generates a single correct match for each point of interest, resulting in more precise than the first method. The evaluations are always to be considered not absolute, depending a lot on the experimental data. SURF has been tested in various applications. The FH detector has been tested for the self-calibration of cameras in 3D reconstruction applications. The SURF descriptor has given good results in object recognition applications. Figure 6.42 shows some results of the SURF algorithm for the detection of corresponding points between pairs of stereo images of identical size and a pair of images with different sizes. In the stereo pairs of figure (a), (b), and (c) are detected, respectively, (624,603), (216,206), and (657,660) points of which are correct between 75 and 80%. For graphic clarity, 373, 110, and 250 lines are displayed, respectively, to indicate the corresponding points. We have transversal lines when the correspondences are not correct. In figure (d), the stereo pair has different dimensions (left image reduced by a factor 2) of the figure scene (a). Points (199,603) are determined of which the first 82 matches of the 199 candidates are displayed. Some incorrect matches are observed (intersecting lines).
6.8 SURF Detector and Descriptor
381
Fig. 6.42 Results of the SURF algorithm applied to determine the correspondence between stereo image pairs for different scenes (a), (b), and (c); In d, instead the correspondence is determined between the image of the left reduced in scale of a factor two while that of the right remains with the original dimensions
6.8.3 Harris–Laplace Detector In analogy to the SIFT detector of points of interest (invariants at the scale and rotation), described in Sect. 6.7.1, also with the Harris–Laplace detector [14,26] the points of interest are detected in the scale-space structure. The Harris function is used to localize in multiscale images the potential points of interest while subsequently those points are selected for which the Laplacian simultaneously reaches a maximum value at the various scales. The combined approach, of using the Harris function, to locate the points through a measure of cornerness, in the image domain, and the multiscale Laplacian function, to simultaneously select those that reach a local maximum in the scale-space domain, it allows to detect points of interest more robust with respect to the invariance at the scale, lighting, rotation, and noise. The scale-space images g LoG (x, y, σ ) are obtained by applying to the initial image I (x, y) the convolution with the normalized kernel of the Laplacian of Gaussian hˆ LoG (x, y, σ ) given by (6.31) to vary the scale parameter σ : gˆ LoG (x, y, σ ) = hˆ LoG (x, y, σ ) I (x, y) (6.58) The potential points of interest are instead located, at various scales, with the Harris function, based on the moments of the second order, defined by (6.14) which in the multiscale formulation is rewritten as follows: C(x, y; σ I , σ D ) = det[A(x, y; σ I , σ D )] − α[tr (A(x, y; σ I , σ D )]2
(6.59)
382
6 Detectors and Descriptors of Interest Points
where the auto-correlation matrix A, given by (6.11) in the multiscale context [14,26] is rewritten as follows: ⎡
2 h (x, y; σ ) ⎢ ⎢ A(x, y; σ I , σ D ) = σ D G I ⎣
⎤ L 2x (x, y; σ D ) L x (x, y; σ D )L y (x, y; σ D ) ⎥ x,y∈W x,y∈W ⎥ ⎦ L 2y (x, y; σ D ) L x (x, y; σ D )L y (x, y; σ D ) x,y∈W x,y∈W
(6.60)
where (a) the partial derivatives of the image I x , I y have been replaced with the partial derivatives of the Gaussian images at the derivation scale σ D which for graphic simplicity are indicated with L x (x, y; σ D ) = h G x (x, y; σ D ) I (x, y) L y (x, y; σ D ) = h G y (x, y; σ D ) I (x, y)
(b)
(c) (d)
(e)
(6.61)
with h G x (x, y; σ D ) and h G y (x, y; σ D ) the kernels of the partial derivatives of the Gaussian function are indicated (with the scale parameter σ D ) with which the initial image is convoluted. σ D in this context constitutes the scale parameter called derivation scale to which the derivatives of the Gaussian images are calculated for the search of the points of interest on various scales; in essence, the resolution of the information content of the image depends on the derivation scale which, as this scale parameter increases, with the smoothing process there is a tendency to permanently eliminate the intrinsic details of the image. h G (x, y; σ I ) is the Gaussian weight function with standard deviation σ I which defines the window size W . σ I indicates the standard deviation of the Gaussian window W on which the information of the moments of the second order are accumulated. In this multiscale context, in fact, the size of the W window must harmonize with the derivation scale determined by σ D . The scale parameter σ I is called scale of integration. The relationship between the two scale parameters can be calculated automatically [14,26] but is normally defined by σ I = γ σ D with the value of γ between 2 and 4. The parameter of scale σ I determines the current scale at which the points of interest with the Harris cornerness measure are detected in the Gaussian scale-space. σ D2 represents the normalization factor to ensure consistency between the different scales.
The Harris–Laplace approach, therefore, uses Harris’s multiscale function to detect potential points of interest in all levels of the scale-space constructed with Laplacian images (unlike the SIFT method that uses DoG images) and, select from these points, those that in the simultaneously normalized LoG images reach the maximum value at the various scales (determination of the characteristic scale for each final point of interest). The scale-space resolution is predefined by the scale parameter σn = k n σ, n = 1, 2, . . . where k = 1/γ is the scale factor between two successive levels of the scale-space. The location of the points of interest in the LoG images of the scale-space occurs by selecting the local maxima and using the approach illustrated
6.8 SURF Detector and Descriptor
383
Fig. 6.43 Results of the Harris–Laplace and Harris Multiscale detector. a Result applied to the resized image by a factor of 2; b applied to the original image of 256 × 256; c applied to the reduced and rotated image of 45◦ ; d applied to the original rotated image of 45◦ ; and e result of the Harris-Multiscale detector applied to the original image
in Fig. 6.24 which compares the point in question with the 8 neighboring points in the same LoG image and the 9 + 9 neighboring points belonging to the adjacent upper and lower LoG images (in this case the scale is the integration one). In order to filter unstable points, in relation to the application context, experimentally appropriate thresholds are determined to filter very low values of the cornerness measurement and subsequently occur even if the selected point has a lower value compared to a predefined threshold in the normalized LoG image. Figure 6.43 shows some results of the Harris–Laplace detector applied to the original image, resized by a factor of 2 and rotated both by 45◦ . It should be noted that with this approach fewer points of interest are detected for greater selective capacity in seeking more significant points. This is essentially due to the fact that each point of interest is bound to satisfy two conditions, that of having the maximum value of cornerness in the image domain, and simultaneously it must be a local maximum in the scale-space characterized by the derivation scale σ D . This can be verified for example by simply using the Harris-multiscale detector (see Fig. 6.43e) which is based on the search for points of interest on different scales with the consequence of detecting the same point several times precisely for the fact that the filter created with the selection of the characteristic scale obtained with the Harris–Laplace detector is missing. Some variants of the Harris–Laplace detector have been implemented in order not to excessively reduce the points of interest that can be a problem in some object recognition and stereovision applications for the reconstruction of the depth map.
6.8.4 Hessian–Laplace Detector In analogy to the Harris-Laplace detector, applying the same basic setup, we obtain the Hessian–Laplace detector [27] using the Hessian detector (described in Sect. 6.5) for determine potential points of interest. The detector uses the Hessian matrix to
384
6 Detectors and Descriptors of Interest Points
locate points in the image domain, and the Laplacian function to calculate their characteristic scale. In this context, it is necessary to redefine the Hessian matrix (6.27) for a multiscale Hessian detector given by L x x (x, y, σ D ) L x y (x, y, σ D ) (6.62) H essian(I ) = L x y (x, y, σ D ) L yy (x, y, σ D ) where L x x (x, y, σ D ), L yy (x, y, σ D ) and L x y (x, y, σ D ), given from (6.61), are the second derivatives of the image I (x, y) convolved with a Gaussian kernel with standard deviation σ D . The determinant of the Hessian matrix is given by Det (H essian(L(x, y, σ D ))) = DoH (L(x, y, σ D )) = L x x (x, y, σ D )L yy (x, y, σ D ) − L 2x y (x, y, σ D )
(6.63)
Recall that the second derivatives used in the Hessian matrix are useful for characterizing the local geometry around points of interest. In fact the eigenvectors of the Hessian matrix give the direction of the principal curvatures (minimum and maximum) while the associated eigenvalues give the width of the curvatures in the direction of the eigenvectors. The scale-space is constructed in a similar way to that of the Harris–Laplace detector based on the Laplacian images defined by the derivation scale parameter σ D which differs by a certain scale ratio (for example, 1.4) between successive images. Having to detect with the Hessian matrix, the extreme points in the scale-space images (i.e., at the various scales), the matrix must be invariant to the scale. It follows that the Hessian matrix must be normalized by multiplying it by the factor σ D which represents the image scale. The potential points of interest are determined at the various scales with the following normalized function based on the Hessian determinant of the Laplacian scale-space function: ˆ (L(x, y, σ D )) = σ D4 [L x x (x, y, σ D )L yy (x, y, σ D ) − L 2x y (x, y, σ D )] (6.64) DoH where σ D4 is the normalization factor that makes the cornerness measurement expressed by (6.64) invariant to the scale. For each image of the scale-space, the points are extracted by comparing the cornerness measurement given by the function ˆ (L(x, y, σ D )) in the point under examination with the adjacent 8-neighbors. DoH It is considered a potential point of interest, the point in question (thus saving its location in the image plane), if its cornerness value is greater than its 8 neighbors and is higher than a certain minimum threshold, to filter points with values of very low maximum (regions with little significant contrast). A potential point of interest can be detected several times in the scale-space. So far we have described the Hessian-multiscale detector (see Fig. 6.44). Once the potential points of interest in the 2D domain have been determined for each image of the scale-space, the next step is to determine their characteristic scale that solves the problem of multiple point selection. The latter is the scale where the image region centered at the point of interest leads to a maximum (or minimum) of the information content. To determine the scale, having each point detected (x, y, σ D ), we could use the solution to select the local maxima of the cornerness function expressed by (6.64). Experiments have shown that the most effective approach is the
6.8 SURF Detector and Descriptor
385
Fig. 6.44 Results of the Hessian-Multiscale detector (first row) and Hessian–Laplace (second row). Application of the Hessian-Multiscale detector: a Original image; b Original image reduced by a factor of 2; c original image rotated by 45◦ ; and d Rotated and reduced image. Results of the Hessian–Laplace detector shown in the figures e, f, g and h applied to the same images as the Hessian-Multiscale detector
one used for the Harris–Laplace detector, namely to filter potential points of interest, previously detected independently for each scale, analyzing the normalized LoG images in the scale-space by choosing points that reach the extreme values along the axis of the scale parameter. For a point located in (x, y) in the scale-space image with local cornerness maximum, the value of the Laplacian is analyzed along the vertical axis of the σ D scale and the scale for which the Laplacian reaches a local maximum is assigned as a characteristic scale for that point. The local maximum must be understood by comparing the adjacent scales and also considering that it must exceed a predetermined minimum threshold. In the hypothesis that several local maxima were found, for some points, several characteristic scales are associated with them. The normalized multiscale Laplacian function used to select the characteristic scale of each point is given by (6.58) which, according to the current symbology, we
386
6 Detectors and Descriptors of Interest Points
can rewrite it as gˆ LoG (x, y, σ D ) = hˆ LoG (x, y, σ D ) I (x, y) = σ D2 |L x x + L yy | = σ D2 [tr (det (DoH (L(x, y, σ D )))]
(6.65)
where we see the advantage of using the Hessian matrix in this case, i.e., how the Laplacian function can be calculated using the trace of the Hessian matrix. The accuracy of the location of the point of interest determined in the scale-space images is compromised by the smoothing introduced by the Gaussian kernel at the different scales. In the literature, iterative algorithms are reported [2] to reduce the uncertainty of the location using a functional 3D quadratic but this is possible when the location of the point occurs simultaneously operating in the spatial and scale domain. For the detector described this is not possible considering that the two functions used, the determinant of the Hessian matrix (which determines the point in the image domain) and the normalized multiscale Laplacian function (which selects the characteristic scale), are separate. Figure 6.44 shows at a qualitative level the results of the Hessian–Laplace detector applied to the same image (scaled down factor 2 and rotated by 45◦ ) used for the multiscale Hessian detector. It highlights how significant regions are well detected for the same scaled and rotated image by demonstrating the detector’s invariance capabilities. It is also observed, the ability of the detector Hessian–Laplace compared to the simple Hessian multiscale, in locating more accurately the regions of interest at various scales and filtering redundant regions. Finally, the Hessian–Laplace detector exhibits a greater number of points of interest than that of Harris-Laplace but with a lower repeatability.
6.9 Affine-Invariant Interest Point Detectors The point of interest detectors described in the previous paragraphs are invariant to translation, rotation, and scale. In several applications, for example, for objects observed from very different points of view, the geometry of the local surface in the images cannot be described considering only the change of scale and rotation. Moreover, the local characteristics of the regions centered on the points of interest can present a complex, strongly anisotropic texture, with structures oriented in different directions and various scales. In essence, every elementary region (patch), a potential point of interest, should be correctly treated in terms of perspective geometry. This operation is complex and computationally expensive. One solution is to simplify the perspective correction by approximating it by affine geometric transformations (described in Chap. 3 Sect. 3.4). With the affine transformation, the geometric model of the rotation-scale invariance represented by a circular region is transformed into an elliptical region characterized by the local properties of the image (see Fig. 6.45).
6.9 Affine-Invariant Interest Point Detectors
387
Fig. 6.45 The need for a detector with affine invariance to compare image regions resulting from affine transformations. a Part of the image d deformed with affine transformation (obtained with scale reduction and shear of 25◦ along the x axis). The circular region, centered at the point of interest, with the affine transformation, can be approximated and geometrically characterized with an ellipsoidal region in the transformed image b. Proposed a detector in which a local affine deformation is modeled through an iterative process that transforms the initial circular region into an ellipse (figures b and e) and then retransforms this affine approximate region, in a circular and scaled (normalized) form, seen as a reference form available for evaluating the correspondence of the points (figure c and f)
6.9.1 Harris-Affine Detector Mikolajczyk and Schmid [14,26] proposed an interest detector based on the Harris multiscale algorithm and invariant with respect to affine transformations. In literature it is known as Harris-Affine Detector which can be considered as a combination of the Harris detector, the scale-space approach based on the Gaussian function and the iterative algorithm for the adaptation of the affine forms to isotropic regions of the image (circular shapes, see Fig. 6.45c, f).
6.9.1.1 Affine Adaptation Let us now see the mathematical foundations to model a generic invariant affine transformation starting from the available information, i.e., the matrix of second-order moments calculated on the region centered at each point of interest. An anisotropic region is modeled locally, through the matrix A of the second-order moments, which in the more general form, in the affine Gaussian scale-space, can be defined as follows: A(x; I , D ) = det ( D )h G (x; I ) [∇ L(x; D )∇ L(x; D )T ] where h G (x; I ) =
(6.66)
−1 x
xT I 1 e− 2 √ 2π det I
∇ L(x; D ) = ∇h G (x; D ) I (x)
(6.67) (6.68)
388
6 Detectors and Descriptors of Interest Points
The derivation scale σ D and the integration scale σ I are no longer expressed in scalar form but, respectively, by the covariance matrices D and I , of size 2 × 2. It should be noted that “ indicates the convolution operator and x = (x, y) indicates the coordinates of the points in vector form. Apparently, (6.66) would seem very different from (6.60) used for the Harris–Laplace detector. In reality, the matrix defined with (6.60) is the version that models an isotropic circular region, where the covariance matrices are considered as identity matrices of the size 2 × 2 multiplied by the scalar factors σ D and σ I . In the version given by (6.66), the two matrices can be considered as the Gaussian kernels of a multivariate Gaussian distribution as opposed to a uniform Gaussian kernel representing an isotropic circular region. With the generalized version (6.66), a region can be modeled by an elliptical shape given by (6.69) (x − xc )T A(x − xc ) = 1 centered in xc with the principal axes corresponding to the eigenvectors of A (thus giving the orientation of the ellipsoid). The lengths of the semi-axes are equal to the square roots of the inverse of the respective associate’s eigenvalues (which defines the size of the ellipsoid). The elliptical shape would, therefore, be more adequate to model the geometry of the region at the characteristic scale. The goal of an affine-invariant detector is to identify regions of interest in the image whose geometries are related to affine transformations. Now let’s consider a region centered at the point of interest x L and the affine transform of this point is x R = Bx L , where B is the affine transformation matrix. According to [28], the matrices of the moments of the second order, relative to the considered region and its affine transform are related to each other as follows: A(x L ; I,L , D,L ) = BT A(x L ; I,L , D,L )B A L = A(x L ; I,L , D,L ) A R = A(x R ; I,R , D,R )
(6.70)
AL = B A R B T
I,R = B I,L BT
and
D,R = B D,L BT
where I,L and I,R are the covariance matrices relative, respectively, to the considered region for the point x L and of its affine transform. At this point, we make the following consideration remembering (6.69), which describes the elliptic geometry of A. If a generic region centered in the point xc , we have an estimate of the matrix A that represents the local geometry of this region, it is possible to nor mali ze such region, always remaining centered in xc , through a geometric transformation described by a transformation matrix U with the characteristic that the elliptic geometry associated with A becomes a circular geometry with unit radius. The normalization relation would result in the following: xˆ = Ux = A1/2 x
(6.71)
having considered the center of the region at the origin of the coordinate system. A1/2 is any matrix such that A = (A1/2 )T (A1/2 ). We know that the matrix A is symmetric, positive and therefore diagonalizable obtaining A = T D , where
6.9 Affine-Invariant Interest Point Detectors
389
is an orthogonal matrix and D a diagonal matrix with positive values. It follows that the normalization matrix can be given by U = A1/2 = D1/2 Let us now return to the two regions with elliptic geometry, related by affine transformation x R = Bx L and described for (6.69) by the two ellipses: x TL A L x L = 1
x TR A R x R = 1
and
Being able to calculate the respective matrices of the moments of the second order, A L and A R , in force of (6.70), these regions can be normalized by transforming their initial anisotropy into isotropic regions as follows: 1/2
1/2
xˆ L = A L x L
xˆ R = A R x R
(6.72)
In reality, the affine transformation matrix B that relates the two regions associated with the x R and x L points is not known. In fact from (6.70) in the equation A L = 1/2 1/2 BT A R B are known A L and A R while the affine matrix B must be determined. It is shown [29] that this equation does not have a single solution: −1/2
B = AR
1/2
RA L
(6.73)
where R is an arbitrary orthogonal transformation matrix, a combination of rotation and reflection. Equation (6.73) can be rewritten as follows: 1/2
1/2
A R Bx R = RA L
(6.74)
and for Eq. (6.72) we obtain the relation: xˆ R = Rxˆ L
(6.75)
which establishes the relation between the normalized coordinates of the two regions associated respectively with the points xˆ R and xˆ L . The following result was obtained (see also in Fig. 6.46 the graphic representation of the process of normalization/deskewing). If the coordinates of two corresponding points of interest are related by affine transformation, subsequently, the respective normalized coordinates are related by an arbitrary orthogonal matrix, i.e., a combined transformation of rotation and reflection. This does not allow a complete registration between the two regions. In practice, using the information of the gradient in the normalized coordinate system, the relationship between the two corresponding regions is solved by calculating the pure rotation as done for the SIFT descriptor aligning them with the dominant orientation information of the gradient. To solve instead the problem of the estimate, as accurate as possible, of the local geometry of the regions it is necessary to calculate the matrix that describes the local shape. This is solved through an iterative approach called the Iterative Shape Adaptation, which transforms the image regions into a reference system of normalized coordinates. As described for the Harris detector, we know that the eigenvalues and eigenvectors associated with the matrix of the second-order moments A(x, I , D ), characterize the local geometry (curvature and shape) of an image region.
390
6 Detectors and Descriptors of Interest Points
Fig. 6.46 Scheme of the affine normalization/deskewing process realized using the matrices of the moments of the second-order thus reducing the problem to the rotation of the elliptic regions and rescaling to obtain the circular shape of reference
In fact, the eigenvectors associated with the larger eigenvalues indicate the direction where there are large variations in intensity while those associated with eigenvalues with a lower value exhibit less variation. From the geometric point of view, the analysis of eigenvalues in the 2D context leads to an ellipse described by (6.69). An isotropic region tends to have a circular shape and from the analysis of the eigenvalues, we observe that their value is identical. A measure of the local isotropicity Q of a region can be determined from the following: Q=
λmin (A) λmax (A)
(6.76)
where λ indicates the value of the eigenvalues with the measure of isotropicity included in the interval between 0 and 1 (to which corresponds the circular geometry that is the perfect isotropy of the local region).
6.9.1.2 Iterative Affine Adaptation Algorithm From, the mathematical analysis described, the Harris- affine detector is based on an iterative algorithm to calculate an accurate second-order moment matrix (indicated with U, shape adaptation matrix) which transforms an anisotropic region in a normalized reference region by controlling the level of isotropy reached with the measure Q, basically close to 1. In each iteration, the characteristic parameters (location x, integration scale σ I , and derivation scale σ D ) of the regions of interest are updated analogously to the Harris–Laplace detector method. The calculation of the U matrix takes place in this normalized domain, in each kth iteration, and it is useful (k) to indicate the location of the transformed point with xw .
6.9 Affine-Invariant Interest Point Detectors
391
The essential steps required by the algorithm are the following: 1. Detect the potential regions of interest with the Harris–Laplace detector and uses them to initialize the matrix U(0) = Identity Matrix and the characteristic (0) (0) parameters of the regions found, x(0) , σ I , σ D . 2. Apply the current matrix, which is the matrix of the preceding step U(k−1) to nor mali ze the window W (xw ) = I (x) centered on U(k−1) xw(k−1) = x(k−1) The affine region (elliptic) is normalized (transformed) to a circular region. (k−1) with the 3. Redeter mine the integration scale σ Ik for the region centered in xw same methodology used for the Harris–Laplace detector (i.e., select an appropriate scale of integration by analyzing extremes in the normalized LoG images of the scale-space). (k) 4. Redeter mine the derivation scale σ D related to the integration scale less than a k k constant factor: σ D = sσ I , with s, factor constant, suitably chosen in the range between 0.5 and 0.75 to keep the two scales balanced. Defined the scales we tend (k−1) min (A) , σ I , σ D ). to maximize the measure of isotropy Q = λλmax (A) , where A = A(xw
5. Determine the new location xw(k) by selecting the point that maximizes the Harris cornerness measure: (k)
(k)
(k)
(k)
arg max det (A(xw , σ I , σ D )) − α[tr (A(xw , σ I , σ D ))]2 (k−1)
xw ∈W (xw
)
where A is the autocorrelation matrix defined by (6.66) and the window W (xw(k−1) ) indicates the set of 8-neighbors of the point of previous iteration in the normalized (k) system. The new point chosen xw must be projected in the normalized plane through the adaptation matrix U with the following: x(k) = x(k−1) + U (k−1) · (xw(k) − xw(k−1) ) 6. Calculate and save the second-order moments matrix Ai(k) = A− 2 (xw(k) , σ I(k) , σ D(k) ) 1
which represents the transformation matrix for the projection in the normalization domain. 7. U pdate the transformation matrix U(k) = Ai(k) ·U(k−1) and normalize U(k) setting the maximum eigenvalue: λmax (U(k) ) = 1. 8. Repeat from step 2 until a stop criterion is reached based on the analysis of the eigenvectors of the adaptation matrix that transforms an anisotropic region into an isotropic one. This is controllable with the isotropic measure Q that would reach the stop situation toward its maximum value of 1. Experimentally [27], a sufficient condition of stop was found if the following occurs: (k) λmin (Ai )
< εC λmax (Ai(k) ) where a good result is achieved by εC = 0.05. 1−
392
6 Detectors and Descriptors of Interest Points
Fig. 6.47 Scheme of the affine normalization/deskewing process realized using the matrices of the moments of the second order thus reducing the problem to the rotation of the elliptic regions and to their change of scale to obtain the circular form of reference (explanation of the figure in the text)
Figure 6.47 shows the iterative process for detecting affine-invariant points of interest in a pair of images. In the column (a), the initial images are shown with the point of interest detected in both using the Harris multiscale detector and automatically selecting the characteristic scale identifying the two circular regions. The columns from (b) to (d) show instead the result of the adaptation process based on the matrix of moments of the second order which in the example is iterated in three steps obtaining a relocation of the point and an elliptical approximation of the affine region Local. The column (e) reports instead the result of the normalization process (through the rotation and change of scale) that transforms the affine elliptic region into a circular one. Subsequently, from the normalized regions, the Lowe descriptors can be determined to validate the correspondences of the points detected between pairs of images.
6.9.1.3 Performance of the Harris-Affine Detector The convergence of the adaptation algorithm to the local structure of the point of interest normally takes place within 10 iterations. The number of iterations can vary in relation to the estimate of the initial parameters in particular to the integration scale σ I . In [30] a comparative evaluation is reported between this Harris-Affine detector and other affine invariant detectors. A measure of repeatability Rs is defined in [27] as the ratio between the points detected and the smaller of the number of points determined in the two images: C(I1 , I2 ) Rs = min(n 1 , n 2 ) where C(I1 , I2 ) indicates the number of corresponding points in the two images I1 and I2 , while n 1 and n 2 are the number of points determined in the respective images. A further measure is defined based on the practical evaluation of the detector’s ability
6.9 Affine-Invariant Interest Point Detectors
393
Fig. 6.48 Results of the Harris-Affine detector (first line) and Harris-Laplace (second line). The images of the second column are obtained from those of the first column deformed with shear of 25◦ along the y-axis and reduced to a scale of a factor of two. The images of the third column differ from those of the second column only because of the further reduction in scale by a factor of four. The ellipses of the first line indicate the affine regions characterized by the Harris-Affine detector unlike that of Harris–Laplace which are considered circular although invariant in rotation and scale
to identify the correspondence of points between different images. In [30], a SIFT descriptor is used to identify the corresponding points. Besides being the closest points in the SIFT space, two homologous points must also have a sufficiently small overlap error (as defined in the repeatability measurement). The matching measure Ms defined is given by the ratio between the number of corresponding points and the minimum of the total points detected in each image: Ms =
M(I1 , I2 ) min(n 1 , n 2 )
where M(I1 , I2 ) indicates the number of corresponding points in the two images I1 and I2 , while n 1 and n 2 are the number of points determined in the respective images. Such measures are influenced by the diversity of the images which depends on the amount of variation of the viewing angle, the change of scale, the level of blurring of the images, the change in the lighting conditions, and the artifacts introduced in the coding (for example JPEG) of the images. Figure 6.48 shows the results of the
394
6 Detectors and Descriptors of Interest Points
Harris-Affine detector (first row) and Harris–Laplace (second row) applied to the original image house (first column), to the original deformed image with shear of 25◦ along the y-axis and scaled by a factor of two (second column), and to the original deformed image with the same shear but reduced by a factor four (third column). It should be noted that the greater number of points found with the Harris-Affine detector (used the software VL_FEAT) depends on the different scale-space used compared to Harris–Laplace’s. In fact, in the first case for the scale-space are used DoG images while in the second LoG images.
6.9.2 Hessian-Affine Detector The Hessian–Laplace detector can be extended, as done by the Harris–Laplace detector, to extract similar covariant regions, thus obtaining the Hessian-Affine detector. The substantial difference between this detector and that of Harris-Affine is given by the search mode of the initial points of interest, based on the Hessian– Laplace algorithm instead of using the Harris function The iterative adaptation procedure that transforms an anisotropic region into an isotropic one for each point of interest remains exactly the same as that of the Harris-Affine detector described in Sect. 6.9.1.3. A quantitative evaluation of the two detectors is found in [30] which shows a better reliability of the results with the Hessian-Affine detector than that of Harris-Affine. In particular, in the scenes rich in linear texture the Hessian-Affine detector exhibits better performances with particular regard to the repeatability and correspondence measurement. Figure 6.49 reports the results of the Hessian-Affine detector applied to the same image used for the Harris-Affine detector and with the same deformations.
Fig. 6.49 Results of the Hessian-Affine detector. a The original image with ellipses indicating affine detected regions; b The original deformed image with shear of 25◦ along the y-axis and scaled by a factor of two; c Image that differs from the image b only for the scaling factor that is four
6.10 Corner Fast Detectors
395
6.10 Corner Fast Detectors The detectors described so far extract points of interest that can usually be isolated points, edges, contour points, and small homogeneous areas. All the detectors already described define the points of interest through a measure of local and directional variation of the intensity based on the autocorrelation and derived until the second order (Harris and Hessian). This paragraph describes the most widespread corner detectors with the peculiarity of being particularly fast while maintaining high repeatability. They are suitable for real-time applications and capable of processing video sequences of images.
6.10.1 SUSAN—Smallest Univalue Segment Assimilating Nucleus Detector This point of interest detector, introduced by Smith and Brady [31], differs from the others, based on intensity, as it does not calculate the local gradient, which is normally sensitive to noise and also requires a higher computational cost, but uses a morphological approach. It can be seen as a low-level image processing process that in addition to detecting corners is also used as a contour extraction and noise removal algorithm. The functional scheme is the following. A circular mask of the predefined radius is placed on each pixel of the image to be tested (see Fig. 6.50a). In this way, the set of pixels to be compared with the intensity of the pixel under examination is called the nucleus (reference pixel). The objective is to compare the nucleus with all the pixels circumscribed by the mask to calculate the area Mx y of the pixels that have an intensity similar to that of the nucleus located in (x, y). This area with similar pixels is called U S AN , an acronym to indicate Univalue Segment Assimilating Nucleus, which contains structural local information of the image. In essence, the mask’s pixels are grouped into two categories, those belonging to the area U S AN which are similar to the nucleus and those dissimilar outside the area U S AN . Figure 6.50b shows a black rectangular area and 4 circular masks located at different points. It can be seen that in configuration 1 the nucleus is on a corner and the area U S AN is the smallest compared to all other configurations. After analyzing the other configurations (2, 3, and 4) shown in the figure, we deduce the following: (a) In the uniform zones, where the area is similar to the nucleus corresponds to almost the entire area of the mask, close to 100% (configuration 3, see also figure d); (b) In the edge zone (configuration 2) the U S AN area is lower, falling around 50%; (c) With the nucleus in the vicinity of a cor ner , the area U S AN is reduced to around 25%, that is, it corresponds to a minimum area (configuration 1). In figure (c), the masks of figure (b) are colored with white U S AN areas.
396
6 Detectors and Descriptors of Interest Points
Fig. 6.50 Circular masks in different positions on a simple binary image. a Approximate circular mask, discretized in 37 pixels with a 3.4 pixel radius; b Four different positions of the mask to detect local image structures; c As in b but with the areas U S AN colored in white; d The 3 configurations of U S AN areas, for corner, edge, and uniform zones, respectively, found as an example on a real image
In essence, the basic idea of SUSAN is to characterize every point of the image with the local area of similar brightness. The U S AN area is calculated by comparing each pixel of the mask Mx y with the intensity of the nucleus in position (x, y): I (i, j)−I (x,y) 6 t e− (6.77) n(x, y) = (i, j)∈Mx y
where I (i, j) is the generic pixel of the mask Mx y centered on the nucleus with position (x, y) and intensity I (x, y), t is the (experimentally evaluated) parameter that controls the sensitivity to noise, i.e., it defines a similarity threshold between the intensity of the mask’s pixels and the reference one (the nucleus), and the power index 6 is used to obtain a measurement of optimal cornerness as demonstrated in [31]. A quantitative measure of cornerness can be calculated considering the U S AN area n(x, y) and what is analyzed initially with the observation c where the corner configuration is in correspondence with the minimum area of U S AN . The SUSAN measure of cornerness C S (x, y) can be evaluated with the following function: g − n(x, y) if n(x, y) < g (6.78) C S (x, y) = 0 otherwise where g is a predefined threshold that is set equal to n max /2 to have a measure of cornerness, while setting g = 3n max /2 gives an edge measure. The measure of
6.10 Corner Fast Detectors
397
Fig. 6.51 Results of the SUSAN corner detector
cornerness C S has high values in correspondence of the minimum area of U S AN determining points of interest of type corner, while for intermediate values determines points of edge. The SUSAN image obtained with (6.78) can be improved by applying the non-maximum suppression algorithm. Modified versions, based on morphological operators, the SUSAN detector was developed to improve the accuracy of the location of the corners. Figure 6.51 shows the results of the SUSAN corner detector. This detector can also be used to determine the contours. It is computationally fast, and has sufficient repeatability.
6.10.2 Trajkovic–Hedley Segment Test Detector This detector [32] uses the basic idea of the SUSAN detector. Instead of evaluating the pixel intensities of the circular window, only the pixels in the Bresenham circle of predefined radius are analyzed.3 In this case, the pixels are analyzed in pairs finding themselves diametrically opposed to the extremes of an arbitrary segment passing through the center of the circular window (the nucleus of SUSAN) and having as extremes the pixels P and P in question identified on the circle (see Fig. 6.52a). The cornerness measurement defined by this detector is given by the following function: C(x, y) = min[(I P − I p )2 − (I P − I p )2 ]
∀P, P ∈ {Pixel of the Circle}
(6.79)
C(x, y) indicates the cornerness output map of the detector where each (x, y) pixel reports a cornerness measure considered a potential corner if C(x, y) ≥ T with T
3 Bresenham proposed an algorithm to determine the discrete points that define a circle. In the image
context, the algorithm finds the path of the set of pixels that approximate a circumference of radius r.
398
6 Detectors and Descriptors of Interest Points
Fig. 6.52 Notations for the Segment Test algorithm and possible detectable local structural configurations: a Discrete circle for comparing the pixel under examination p on which it is centered and the pixels identified by the circle; b corner configuration, c edge, and d homogeneous zone
which indicates an adequate threshold. I P , I P are, respectively, the diametrically opposite intensities of the P and P pixels identified by the arbitrary segment, centered on the pixel under examination p and intersecting the ring also centered in p whose intensity is indicated with I p . In Fig. 6.52, we can observe different structural configurations distinguishable with the measure of cornerness C(x, y). These configurations are (a) Cor ner . Represented in Fig. 6.52b where a potential pixel corner is located in p, center of the ring, if there is at least one pair of pixels with I P or I P other than I p . The potential corner points have a very high cornerness size. (b) Edge. This configuration (see Fig. 6.52c) occurs when a single segment P P is exactly superimposed on an edge with intensity values between I P , I P and I p resulting approximately identical. It achieves intermediate values of the cornerness measure for this edge configuration. (c) Homogeneous zones. Represented in Fig. 6.52d occurs when the ring in question is superimposed in a dominant manner on a homogeneous area of the image. In this case, more than one segment has identical values of the three pixels I P , I P and I p . It follows a very low value of the cornerness measure. A possible configuration is in the case of isolated pixels for which the cornerness measurement results with very high values considering that each segment will exhibit values of I P and I P systematically different from I p . The isolated points are normally caused by the noise in part removable with Gaussian prefiltration of the image. This detector proves to be computationally efficient (by a factor 3 faster than SUSAN) and sufficient repeatability. On the diagonal edges, it produces false corners. A multigrid version shows a good accuracy in the location of the corners through a local linear interpolation (up to 4 × 4 window and 8-adjacency) of the intensity at the subpixel level.
6.10 Corner Fast Detectors
399
6.10.3 FAST—Features from Accelerated Segment Test Detector This detector originally proposed by Rosten and Drummond [21,33] extracts points of interest based on the SUSAN detector. FAST, as suggested by the acronym, is motivated for the extraction of corners in real time. The SUSAN detector calculates a percentage of pixels that are similar to a reference pixel within a certain circular area. The FAST detector, on the other hand, by analogy with the Trajkovic–Hedly detector, starts from this idea and compares only 16 pixels on the circumference (a Bresenham circle of radius 3) to detect if the point in question is a potential corner. Each pixel of the circumference is indicated by the numbers from 1 to 16, clockwise (see Fig. 6.53). Originally, the Segment Test identified p as cor ner if the existence of a set of n contiguous pixels on the circumference (defining an arc) that were brighter than the intensity I p of the pixel p in question plus a threshold value t, or they are all darker than the value I p − t (see Fig. 6.53a). The potential corner conditions are defined as follows: Condition 1: A set S of n contiguous pixels, ∀x ∈ S ⇒ I x > I p + t Condition 2: A set S of n contiguous pixels, ∀x ∈ S ⇒ I x < I p − t This first approach is conditioned by the choice between the parameter n, the number of contiguous pixels (i.e., the length of the arc) and the value of the threshold t (for example considering the 20% of the test pixels). In the initial version, n is chosen equal to 12 allowing high-speed tests to obtain the elimination of a large number of non-corners. The quick test used to exclude non-corners only examines the 4 pixels numbered with 1,5,9,13 (see Fig. 6.53b). Pixels 1 and 9 are analyzed if both values of I1 and I9 are included in the interval [I p − t, I p + t], then the potential corner p is not a corner. To check if p could be a corner, pixels 5 and 13 are examined, to test if at least 3 of these pixels are brighter than I p + t or darker than I p − t. It follows that, if neither condition is met, p is not considered a corner and is eliminated. Otherwise, once the existence of at least 3 pixels that are both brighter or darker exists, the remaining pixels on the circumference are examined for the final decision. This approach, although fast, highlights the following drawbacks: 1. The test cannot be generalized for n < 12 since the potential pixel p can be corner, if only two of the four pixels are both brighter than I p + t or both darker than I p − t. The number of potential corners is very high. 2. The efficiency of the detector [34] depends on the choice and ordering of the 16 test pixels selected. However, it is unlikely that the pixels chosen are optimal with respect to the distribution of the appearance of the corners. 3. Multiple adjacent corners are detected.
6.10.3.1 Algorithm Improvement Through Machine Learning Paradigms To solve the first two problems mentioned above, a learning algorithm is introduced which operates as follows:
400
6 Detectors and Descriptors of Interest Points
Fig. 6.53 Segment Test Detector. a Checks if the pixel under examination p is a potential corner analyzing the ring pixels (bold squares). The arc indicates the 12 contiguous pixels that are brighter than the p pixel with respect to a given threshold. P indicates the vector where the 16 pixel intensities are stored. b Quick test to exclude non-corner pixels by analyzing only the 4 pixels in the cardinal positions on the ring
1. Select a set of test images (preferably from the application context). 2. For each test image, the FAST algorithm (Segment Test) is applied for a given n and an appropriate t threshold. In this way, for each pixel under examination the 16 pixels of the ring are examined (centered on the pixel under examination) and the intensity values are compared with the appropriate t threshold. 3. For each pixel p keep the 16 of the ring as a vector (see Fig. 6.53b). 4. Repeat this for all pixels in all images. The P vector contains all the learning data, that is, the set of all the pixels in the learning image. 5. Each value of the P vector, that is, one of the 16 pixels we address with index x and indicated with p → x3, can assume 3 states indicated with S p→x that are: darker than p, similar to p, or brighter than p. These states are so formalized: ⎧ (darker) I p→x ⎨ d, (6.80) S p→x = s, I p − t < I p→x < I p + t (similar) ⎩ (brighter) b, I p + t ≤ I p→x where I p→x indicates the intensity value of the pixel p → x, and t is the threshold. 6. Choosing an index x, identical for all the points p, the vector P will be partitioned into three subsets Pd , Ps , Pb thus defined: Pd = { p ∈ P : S p→x = d}
(6.81)
Ps = { p ∈ P : S p→x = s}
(6.82)
Pb = { p ∈ P : S p→x = b}
(6.83)
7. Define a Boolean variable K p which is tr ue if p is a corner, and f alse if p is not a corner. 8. Use the ID3 algorithm [35] (decision tree classifier) to query each subset Pd , Ps , Pb considering the variable K p for the knowledge of the tr ue class.
6.10 Corner Fast Detectors
401
9. The ID3 algorithm is applied by selecting the x index which produces the most informative content regarding whether the pixel is a potential corner or not, evaluating the entr opy of K p . The total entropy H of an arbitrary set Q of corner is mathematically defined as follows: H (Q) = (c + c) log2 (c + c) − c log2 c − c log2 c
(6.84)
where c = |{i ∈ Q : K i is tr ue}|
(number o f cor ner )
and c = |{i ∈ Q : K i is f alse}|
(number o f non cor ner )
The choice of x leads to the information gain Hg : Hg = H (P) − H (Pd ) − H (Ps ) − H (Pb ) (6.85) 10. With the selection of x which produces the best information, the process is applied recursively to the three subsets. For example, if xb is selected to partition the associated set Pb in the subsets Pb,d , Pb,s , Pb,b , then another xs is selected to partition Ps into subsets Ps,d , Ps,s , Ps,b , and so on, where each x is selected to produce the maximum information for the set to which it is applied. 11. The recursive process ends when the entropy of a subset becomes null and this implies that all p pixels of this subset have the same value as K p , that is, they are all corners or less. 12. This result is guaranteed since K is an exact Boolean function associated with the learning data. It should be noted that the FAST algorithm, based on the ID3 algorithm (see Sect. 1.14 Vol.III), correctly classifies all the corners acquired from the training images and therefore, with good approximation, correctly captures the rules of the FAST corner detector. It is also noted that although the training data are representative of the possible corners, the trained detector cannot detect exactly the same corners as the Segment Test detector.
6.10.3.2 Suppression of Non-maxima The solution to the third drawback, the detection of multiple adjacent corners, can be solved by applying the non-maximum suppression approach. However, in this context, since the Segment Test detector does not use a function to evaluate a cornerness measurement, the NMS approach is not directly applicable to the detected corners. It is, therefore, necessary to define a cornerness function in a heuristic way, which associates a measure of cornerness, to filter the less significant ones between the adjacent corners. A cornerness function can be defined as the sum of the absolute differences between the contiguous pixels of the arc and the pixel under examination p. This cornerness function V is defined as follows: |I p→x − I p | − t; |I p − I p→x | − t (6.86) V = max x∈S Bright
x∈S Dar k
402
6 Detectors and Descriptors of Interest Points
where S Bright = {x|I p→x ≥ I p + t}
and
S Dar k = {x|I p→x ≤ I p − t}
p is the pixel under consideration, t is the threshold used by the detector, and I p→x corresponds to the values of the n contiguous pixels of the ring.
6.10.3.3 FAST Detector Results FAST is a detector proposed for real-time applications considering the high performance in terms of computational load both in the basic version and with the learning component. Compared to slower detectors, such as the DoG (used in SIFT), FAST is up to 60 times faster. Several versions indicated with F AST n for n = 9, · · · , 12 have been tested. FAST9 is the most performing both with respect to the various FASTn versions and to the other corner detectors. Normally, the largest computational cost for suppression of non-maxima is 1.25 per FASTn and 1.5 times compared to the original FAST. While the DoG is very robust to noise due to the built-in smoothing action, FAST in contrast is much less robust. Furthermore, FAST is a variation of the scale and the rotation. It is sensitive to the change in lighting even though it can be controlled by adequately varying the t threshold, which greatly affects the detector’s performance. FAST demonstrates high repeatability. Its peculiar characteristics of speed and repeatability make it well available in mobile robotics and object tracking applications. In particular, for navigation, vision modules are used as visual odometers (based on the corners detected in the scene) for the robot’s auto-localization using the SLAM (Simultaneous Localization and Mapping) approach. FAST-ER (Enhanced Repeatibility) is a version of the FAST detector that improves repeatability using the Simulated Annealing4 algorithm to optimize the search with a ternary decision tree. Although FAST-ER is doubly faster than FAST9, it also exhibits excellent repeatability even in the presence of noisy images. In [34] the authors present a detailed evaluation of the FAST and FAST-ER detectors. The experimental results are reported by comparing them with other corner detectors (SUSAN, Harris, DoG, Shi–Tomasi). Given the high speed and repeatability of these detectors, many tests are performed on video image sequences. Figure 6.54 shows some results obtained with the FAST-9 version of the FAST detector family.
4 Approach
that finds a global minimum in situations where there are several local minima. The simulated annealing terms derive from the metal technology which describes the process of heating and cooling metals to make them more resistant through a reconfiguration of their highly ordered atomic lattice structure. Improvement techniques or local research start from a feasible solution and aim to improve it through the iterative application of a move. They converge in a local optimum. To improve the quality of the solutions that can be identified with these techniques, it is possible to use the so-called meta-heuristic techniques.
6.11 Regions Detectors and Descriptors
403
Fig. 6.54 Results of the FAST9 algorithm obtained with the MATLAB open-source software. a Points of interest detected with the suppression of non-maxima; b Points of interest detected on the image (a) deformed with shear of 25◦ in the direction of x axis; c Points of interest detected on pairs of stereo images for the auto-location of a mobile robot
6.11 Regions Detectors and Descriptors The points of interest thus far extracted, with the relative descri ptor s may not have enough local information to be used, for example, in the recognition of objects in the scene or to solve the problem of correspondence in the image sequences or when during the acquisition process ( due to changes in the observation and lighting points) the images undergo complex deformations. It follows that information based on intensity no longer provides stable information for object recognition. Therefore, all the detectors of regions or points of interest, based exclusively on the intensity on which the algorithms of recognition of the objects are based can fail for the identification of significant structures. In the literature, algorithms have been developed that extract global information from the scene as the principal components used as descriptors. Given the 3D nature of objects, such approaches that extract global information can have limitations in the presence of multiple objects in the scene, occlusions, and shadows caused also by nonuniform illumination. Alternative approaches to global ones have been developed that are a compromise between extracting global information and local information. These approaches extract more local information of an object or parts of it starting from the segmented image or from the contour by applying robust algorithms such as water shed for the segmentation and the Canny algorithm to detect the contours. The regions of interest are detected through the analysis of local properties of extreme variability of brightness in the image (such as the IBR—Intensity extrema-Based Regions detector) or through the analysis of the principal curvature. The latter are referred to as “structured-based” detectors in the literature, which are based on the
404
6 Detectors and Descriptors of Interest Points
geometric structures present in the image such as lines, curved edges with which regions or points of interest are then extracted. While the intensity-based detectors depend on the analysis of local geometry on intensity patterns and detect points or regions of interest that satisfy some uniqueness criteria (SIFT, Hessian Affine, Harris Affine, MSER), the detectors based on local structures depend on the extraction of lines, contours, curves to detect regions that satisfy structural and form criteria (EBR, SISF, PCBR).
6.11.1 MSER—Maximally Stable Extremal Regions Detector This detector, proposed by Matas et al. [36], is based on an approach that extracts stable homogeneous regions from an image, initially motivated to make robust the correspondence of regions between pairs of images acquired from different points of view and for object recognition applications. Unlike the detectors described above that find the points of interest and after the invariance properties are defined, the MSER approach starts from the perspective of segmentation. In fact, it is based on the threshold segmentation algorithm. It extracts homogeneous connected regions with stable intensity by varying a wide range of thresholds, to form maximally stable extremal regions. By construction, these regions are stable with respect to some acquisition conditions, as different points of view of the scene and change of lighting conditions. Since they are extracted through the process of segmentation, these regions are not necessarily elliptical but take on complex contours. However, these contours can be subsequently approximated with ellipses by calculating the eigenvectors and eigenvalues of their matrix of second moments. The same contours in some applications have been considered useful as descriptors. The MSER detector can be considered inspired by a water shed segmentation algorithm. The gray-level image is represented by the function I : → [0..255], where = [1..M] × [1..N ] is the set of pixel coordinates of the image. The thresholding process considers all the possible thresholds t ∈ [0..255] of the image I (x, y) and a threshold image It is obtained as follows: 255 i f I (x, y) > 0 (white r egion) It (x, y) = (6.87) 0 other wise (black r egion) thus obtaining a binary image with two sets of pixels one with white pixels W and the other with black pixels B given by B ≡ {(x, y) ∈ 2 : I (x, y) ≤ t} W ≡ 2 \ B If the value of the threshold t varies from a maximum (white) to a minimum (black) of the intensity, the cardinality of the two sets W and B changes continuously. In the first step, with the maximum threshold tmax all the pixels of the image are contained in B and the set W is empty. As the t threshold decreases in It images begin to be included in the W white pixel set forming ever larger connected regions with the possible merging of regions when the threshold reaches almost the minimum.
6.11 Regions Detectors and Descriptors
405
Fig. 6.55 Evolution of the thresholding process starting from the maximum value of the threshold (black image) to the minimum value (completely white image) in the MSER detector. From left to right the result of 7 different threshold levels is displayed, applied to the input image, obtained with the value of the lower t threshold
At the end of the thresholding process, when the threshold reaches the minimum tmin the image Itmin will result in all the pixels in W and the set B is empty. It should be noted that the same thresholding process can be applied by reversing the intensity of the image I . If we visualize (see Fig. 6.55) in sequence all the binarized images It , t = [min, . . . , max] at the beginning we would see a completely black image and then white dot-like regions would appear that always extend more, some merging, until the final image is completely white. Let us fix the attention now by analyzing the sequence of binary images by observing quasi-point regions (a few pixels seen as seeds) that continuously grow up to a certain image of the sequence representing a physical entity (for example, an object, the text of a print, ..). In other words, some significant regions grow, forming connected regions also generated by the fusion of adjacent regions, and their growth becomes stable as the threshold varies. These regions are called extremal regions for the property that their set of pixels have the intensity greater than the pixels of their contour. Each extremal region is a connected component of a threshold image It . With the help of Fig. 6.56, we can define an extremal region Q as a connected region for a threshold image It if: ∀ p ∈ Q and ∀q ∈ δQ : I ( p) > I (q) where with p and q are indicated, respectively, the pixels of the connected region and its boundary. The condition I ( p) < I (q) must be valid if the thresholding process is done on the negative image. The goal of the MSER algorithm is to detect maximally stable extremal regions among the extremal regions for which the variation of the area at the change of t for a certain interval, normalized to the corresponding area, reaches a local minimum. The MSER regions are detected by considering nested sequences of extremal regions Q 1 ⊂ Q 2 ⊂ . . . Q t ⊂ Q t+1 ⊂ . . . Q max . An extremal region Q t ∗ is maximally stable (i.e., a MSER region) if the following function: |Q t+ \ Q t− | q(t) = (6.88) |Q t | assumes a local minimum at the t ∗ threshold. In (6.88) with the symbol | • | the cardinality of the set is indicated, the symbol \ indicates the operator difference between sets, and the symbol indicates the input parameter of the algorithm, i.e., the interval of the intensity levels of the threshold ( ∈ [0...255]).
406
6 Detectors and Descriptors of Interest Points
Fig. 6.56 Extraction of the extremal regions, among the nested regions, in the MSER detector
The MSER region detector exhibits the following properties: (a) I nvariance to the affine transformations of an image. (b) Stability assured considering that only regions whose support is almost the same over a range of thresholds is selected. (c) Multiscale detector without the need for smoothing convolutions, ensuring the detection of small and large structures present in the image. Experimentally, it has been shown that with the multiscale pyramidal representation it improves the repeatability and the correspondences between the scale changes. (d) Computational complexity depending on the number of extremal regions which in the worst case is O(m × n), where m × n represents the number of pixels of the image. Figure 6.57 shows some results of the MSER detector obtained with the VLFeat software. The parameter = 6 was used for the test image used. Remember that the MSER detector was originally created to achieve the correspondence of points between pairs of images. It is applied initially on gray-level images and on the negative image in a multiscale context. Each of the potential regions of interest of an image of the pair is compared with all the regions of the other image of the pair to determine the best similarity considering also some constraints of the geometric epipolarity of the pair of images. The RANSAC iterative algorithm is used to filter outlier regions. As well demonstrated in [30], the MSER detector was evaluated with different test images and compared with several other points of interest detectors (Harris-Affine, Hessian-Affine, ...) resulting in the most performing.
6.11.2 IBR—Intensity Extrema-Based Regions Detector This is another detector of affine-invariant regions based on the local properties of extreme variability of brightness in the image [37]. The basic idea is to find anchor points to which to hook regions of the image that take an adaptive form to represent the same physical entity of the scene (assumed planar) when viewed from
6.11 Regions Detectors and Descriptors
407
Fig. 6.57 MSER detector results. The first line shows the images processed with superimposed ellipses that approximate the MSER regions found. On the second line the contours of the MSER regions are shown instead for the same images processed. Application of the MSER detector to the following images: a Original image; b Original image reduced by a factor of 2 and rotated by 45◦ ; c original image transformed with shear of 25◦ with respect to the axis of x; d original image transformed with shear of 25◦ with respect to the y-axis and reduced by a factor 2. The figures e, f, g, and h show the contours of the MSER regions found for the same images of the first line
different points of view. In other words, by changing the perspective, these invariant elementary regions change their shape in the image. Although the change in their form is dependent on the point of view, the information content of these elementary regions can remain invariant, deriving from the same physical entity of the scene. The anchor points to be surveyed can be considered as seeds to be associated with the relative elementary regions to be extracted. The essential steps of this method are the following: 1. Initially determines in the image the potential multiscale anchor points that have a high level of repeatability. They can be determined as points corner points, for example, with one of the detectors described above (Harris,...) or as local extreme points of intensity. The latter are localized with less accuracy because, to reduce the noise, the image is first smoothed thus obtaining extremely unstable points. In this case, to reduce the number of local non-maximum extreme points, the NMS algorithm is already used to thin edges. 2. Radially analyzes the local intensity around these anchor points, thus allowing regions with an arbitrary shape to emerge 6.58). 3. Evaluates the intensity function f I (t), starting from each anchor point along each outgoing ray: abs[I (t) − I0 )] t
f I (t) = max
(
0
abs[I (t)−I0 )]dt ,d t
(6.89)
408
6 Detectors and Descriptors of Interest Points
Fig. 6.58 IBR region detector method. The intensity is analyzed radially starting from an extreme point and a point is selected along each radius where the function f I (t) expressed by (6.89) reaches the maximum value. The connection of all these extreme points generates an affine-invariant region approximated successively with an elliptical shape using the moments of the second order
where t is the position along the radius, I (t) is the intensity at the position t, I0 is the intensity at the extreme, and d is a regularization parameter with a small value to prevent division by zero. Normally, the maximum of the function f I (t) is reached where the intensity abruptly increases or decreases. The point where this function reaches an extreme is invariant to the affine geometric transformation and to the linear photometric transformation. This situation generally occurs when the edge of a homogeneous elementary region is encountered. 4. Select and Connect together all these extreme points of the point 3 where the function f I (t) reaches an extreme exploring radially the image to extract an affine-invariant region as shown in Fig. 6.58. 5. Appr oximates, optionally, to an elliptical shape, the affine-invariant arbitrary regions extracted in the previous step. These elliptic regions have the same moments, calculated up to the second order, of the arbitrary forms extracted. The resulting elliptical approximate regions are still affine invariants. The ellipses are not necessarily centered with respect to the original anchor points. This does not prejudice the use of the elliptic invariant regions found to solve the problem of the correspondence between pairs of images acquired from different views. As described in [30], the IBR detector was evaluated with different test images and compared with several other points of interest detectors (Harris-Affine, HessianAffine, ...) resulting in the best performing and in particular, showing a good score on the number of correct matches.
6.11.3 Affine Salient Regions Detector The concept of salience recalls the biological mechanisms of vision to recognize objects in a scene. Normally stimuli in biological vision are intrinsically salient but are often subjective. In artificial vision, a salient detector extracts from a picture points
6.11 Regions Detectors and Descriptors
409
or regions of interest using different approaches based on low-level information which, as seen above, detect in the scale-space corner, contours of regions, etc. This detector explores the affine Gaussian scale-space using different ellipses at different scales and orientations. The concept of salience as proposed in [38] is associated with the entropy of the local intensity distribution of the image calculated on an elliptical region. This low-level information, the entropy, is calculated for each pixel by considering three characteristic parameters (scale, orientation, and radius of the major axis) of the ellipse centered in each pixel. The set of the extremes of entropy and the characteristic parameters of the ellipses are considered potential salient regions. Subsequently, the latter are sorted according to the value of the derivative of the probability density function (pdf) with respect to the scale. The first most representative Q regions are then maintained. Let us now look at the mathematical formalism for the calculation of the salience measure Y of a region. Let D be an elliptical region centered in x with scale parameter s (which specifies the major axis), orientation angle θ , and the ratio between the major and minor axes λ. The measure of the complexity of a region D of the image, i.e., its entropy H is given by p(D) log(D) (6.90) H =− D
where p(D) indicates the pdf of the image intensity calculated on the region D. The measure of the entropy H calculated for each point x on the elliptic region extracts the local statistical information and does not contain spatial information. The set of local maxima of H on the scale is calculated for the parameters s, θ , λ for each pixel of the image. Subsequently, for each local maximum the derivative of the pdf is calculated p(D; s, θ, λ) with respect to the scale parameter s to evaluate how locally the intensity distribution of the region varies. This measure of variability of the pdf is given by ∇ p(D; s, θ, λ) (6.91) W=− ∇s D
The final value of the salience Y of the elliptical region centered at the point x is calculated by weighing the entropy H with W: Y(x) = H(x) · W(x)
(6.92)
Finally, a classification procedure is activated in relation to the salience measure for the choice of the most significant regions. It should be noted that a first version of this method extracted salient regions with circular symmetry (defined with the radius r ) resulting in an invariant detector for geometric similarity transformations although it was faster. Like all points of interest detectors and regions, this is also sensitive to noise and in particular, the measurement of salience is mostly influenced by the effects of noise on the low values of entropy. This detector was included in the comparative evaluation described in [30], evaluating it with different test images and compared with several other points of interest detectors (Harris-Affine, Hessian-Affine, ...). In this evaluation, he obtained a low score even though he was a performant for the recognition of object classes confirming the results described in [38,39].
410
6 Detectors and Descriptors of Interest Points
6.11.4 EBR—Edge-Based Region Detector This detector belongs to the typology among those that in the literature are called structured-based detectors, which are based on the geometric structures present in the image such as lines, curved edges with which regions or points of interest are then extracted. In fact, with the EBR detector, the similar invariant regions are determined by applying to the image a contour extraction algorithm. One of the first methods to extract affine-invariant regions is the one that initially extracts the cor ner points of Harris–Stephens (described in Sect. 6.3) and the edges extracted with the Canny algorithm (described in Sect. 1.16). These elementary geometric structures (corners and edges) are sufficiently stable when they are extracted on images acquired even from different points of view, with the change of scale and illumination. Furthermore, by operating on these geometric structures, the dimensionality of the problem is considerably reduced, as it is also possible to locally explore the geometry of the edges (straight and curved). The robustness of the method is improved by extracting these geometric structures in a multiscale context. Figure 6.59a shows how it is possible to construct an affineinvariant region, starting from a corner point p = (x, y) and using the edges in its vicinity it reaches the points p1 = (x1 , y1 ) and p2 = (x2 , y2 ). Ideally, the p1 and p2 points both depart from the corner point p in the opposite directions along the edges, until the areas L i , i = 1, 2 formed by curved edges li , i = 1, 2 and lines ppi reach a certain threshold value. The areas L 1 and L 2 must remain equivalent while the points p1 and p2 move from p. This condition establishes the affine invariance criterion expressed by parameters l1 and l2 , defined by li = abs(|pi(1) (si )p − pi (si )|)dsi i = 1, 2 (6.93) (1)
where si is an arbitrary parameter of the curve, pi (si ) indicates the first derivative of the parametric function pi (si ) with respect to si , and with | . . . | the determinant is indicated. For each position of l (for simplicity, we will indicate l when we refer to l1 = l2 ), the two points p1 (l) and p2 (l) along with the corner point p delimit a region (l), associated with the corner p, whose extension is a function of l, that is geometrically represented by the parallelogram defined by the vectors p1 (l) − p
Fig. 6.59 Detection of regions with the EBR method. a Construction of an analogous invariant region starting from an edge near the corner p; b Geometric interpretation of the function f 1 (); and c Geometric interpretation of the function f 2 ()
6.11 Regions Detectors and Descriptors
411
and p2 (l) − p. In this way we have a one-dimensional family (considered the only parameter l) of regions, with the shape of a parallelogram. Let us imagine that the motion of the points pi is locked in the position where the photometric measurements relating to the texture of the parallelogram reach an extreme value. From this 1D family of regions, one or more parallelograms are selected for which the photometric texture measurements reach extreme values. With the selection of multiple parallelograms, we introduce the concept of scale with regions of different sizes associated with the same corner. Since it is not guaranteed that a single function reaches an extreme, as the parameter l varies, we are faced with the situation of testing multiple functions. Considering the extremes of several functions, a better guarantee is obtained that a greater number of corners will produce some regions. It follows, with the choice of appropriate functions, that the procedure is invariant to changes in lighting and geometric transformations. The appropriate functions, to measure the photometric quantities of the parallelogram texture with the characteristics of invariance, are 1 M00 |p1 − pg p2 − pg | (6.94) f 1 () = abs |p − p1 p − p2 | 2 M0 − (M1 )2 M00 00 00 1 M00 |p − pg q − pg | f 2 () = abs (6.95) |p − p1 p − p2 | 2 M0 − (M1 )2 M00 00 00
where Mnpq
=
and
In (x, y)x p y q d xd y
pg =
1 M10 M1 , 01 1 1 M00 M00
(6.96)
(6.97)
with Mnpq indicating the moment of order n and degree (p + q)th calculated on the parallelogram (l), pg indicates the center of mass of the weighted region with the intensity I(x, y), and q is the corner point of the parallelogram that is on the opposite side to the corner p as shown in Fig. 6.59. We could also use a simpler computationally 1 /M0 , which represents the average intensity of the (l) function f 3 () = M10 00 region but is not invariant with respect to the affine transformations even if it reaches an extreme invariant value. The functions f 1 () and f 2 () are more appropriate reaching minimum stable points and invariants [37]. In fact, these functions each have two terms; the first expresses the relationship between two areas one of which depends on the center of mass weighed with the intensity (which represents the texture information) and the second term is a factor that regulates the dependence of the first term on the intensity variation of the image. In other words, the second term contributes to guaranteeing the invariance to the change of a constant factor (translation) of the intensity levels. The geometric interpretation of the first term of f 1 and f 2 is shown in Fig. 6.59b, c. We observe the double of the ratio of the colored area with the total area of the parallelogram. In the search for the local minimums of the two functions, more
412
6 Detectors and Descriptors of Interest Points
balanced regions are favored, that is regions for which the center of mass lies or is as close as possible to one of the diagonals of the parallelogram. In the case of straight edges, this method based on the functions f 1 and f 2 , cannot be applied, since the parameter l would be zero along the entire straight edge. Since the intersection of the straight edges is frequent, these cases cannot be overlooked. A solution to the problem would be to combine together the photometric measurements expressed by Eq. (6.96) and the locations where both functions reach a minimum value are considered to fix the parameters s1 and s2 along the straight edges [40]. In this case, we adopt an extension of this method looking for local extremes in a 2D space in the (s1 , s2 ) domain with the two arbitrary parameters s1 and s2 for the two straight edges, instead of the 1D domain represented by the parameter l considered for curved edges.
6.11.5 PCBR—Principal Curvature Based Region Detector The PCBR detector is a region detector based on the principal curvature and belongs to the typology among those that in the literature are called structured-based detectors which are based on the geometric structures present in the image such as lines, curved edges with which regions are then extracted or points of interest. An alternative to detectors based on local information of intensity are those based on structures such as PCBR which can improve the accuracy of object identification. The PCBR detector identifies stable watershed regions based on the principal curvature in the multiscale context. Local structures can be lines and curves extracted from remote sensing images (satellite or airplane) or biomedical images, where straight or curved structures can be associated, respectively, with roads or with the vascular system. The PCBR detector [41] has the following steps: 1. Curvature calculation. This is accomplished using the first step of the Steger algorithm [42], which detects curvilinear structures in the image. Unlike Steger, the PCBR approach generates an image, called the main curvature image, starting from the minimum and maximum eigenvalue of the Hessian matrix calculated for each pixel. The image of the main curvature expresses the measure of the principal curvature of the structures associated with the intensity of the local surface. In this way, single answers are obtained for lines and curves producing a clearer representation of the information content of the image than that produced using the measurement of the image gradient (see Fig. 6.60). The local characteristic shape of the surface at one point of the image is described by the Hessian matrix (6.27) (see Sect. 6.5). 2. Calculation of multiscale images of the principal curvature. The structures of interest to be considered in the image are the rectilinear ones (or curves very close to straight lines) and the edge (decidedly curvilinear structures). These two structures are characterized by having a very high curvature in one direction and very low in the orthogonal direction. With a 3D representation of the intensity of
6.11 Regions Detectors and Descriptors
413
Fig. 6.60 Detection of regions with the PCBR detector. Comparison between the response obtained from the image with a the magnitude of the gradient and that obtained with b the calculation of the principal curvature
the image, the curvilinear structures would correspond to the crests and valleys of a 3D surface. The local characteristics of the shape of the surface in correspondence to a given point x = (x, y) of the image can be described by the Hessian matrix that in the multiscale context with reference to (6.62), we can rewrite it as follows: L x x (x, σ D ) L x y (x, σ D ) (6.98) H(x, σ D ) = L x y (x, σ D ) L yy (x, σ D ) where L x x (x, σ D ), L yy (x, σ D ), and L x y (x, σ D ) are the second derivatives of the image I (x, y) convolved with a Gaussian kernel with standard deviation σ D given by (6.61). In the description of Harris, Harris–Laplace, Harris-Affine detectors, the matrix of second-order moments was used to determine in the image domain where the image geometry varies locally in multiple directions. With the SIFT detector instead the scale-space based on the DoG (Difference of Gaussian) images and the Hessian matrix were used to extract the points of interest. The approach used with the PCBR detector is complementary to these detectors. Instead of finding points of interest, the PCBR detector to extract regions of interest, applies the watershed-type segmentation algorithm to local geometric structures (ridges, valleys, slopes) extracted from the principal curvature image. All this in the context of images acquired from different points of view with change of scale and lighting, that is, in the same operating conditions for the extraction of points of interest. The PCBR detector attempts to recover the very elongated geometric structures that in the Harris detector take very low cornerness measurements due to the low values of the first and second derivatives in a particular direction. The principal curvature image is calculated with one of the following expressions: (6.99) P(x) = max(λ1 (x), 0) or P(x) = max(λ2 (x), 0)
(6.100)
414
6 Detectors and Descriptors of Interest Points
where λ1 (x) and λ2 (x) are the eigenvalues of maximum and minimum of the Hessian matrix H calculated in the point x of the image. Equation (6.99) gives high answers for dark lines on a light background (or on the dark side in the case of an edge) while (6.100) is used to detect light lines on a dark background. 3. Scale-space calculation. The scale-space is constructed with the images of principal curvature in analogy to that realized with the SIFT detector which provides sets of octave images. In this case, the original image is doubled obtaining the initial image I11 to which a Gaussian smoothing filter is applied, thus producing multiscale images I1 j with the scale parameter σ = k j−1 , where k = 31/3 and j = 2, . . . , 6. This set I1 j , j = 1, 6 consisting of 6 images generates the first octave. The image I14 is sampled by a factor of 2 to obtain the image I21 which represents the first image of the second octave. The Gaussian smoothing filter is also applied to this image to obtain the rest of the images of the second octave and the process continues until a total of n = log 2(min(w, h)) − 3 octaves, is created where w indicates the width and h the height of the doubled images. Then the images of principal curvature Pi j , i = 1, n; j = 1.6 are calculated for each smoothed image with (6.99) calculating the maximum eigenvalues of the Hessian matrix for each pixel. To optimize the calculation times, each smoothed image and the associated Hessian image is calculated from the previous smoothed image using an incremental Gaussian-scale parameter given by
(6.101) σ = 1.6k j−1 (k 2 − 1) Constructed the scale-space with the main curvature images Pi j , i = 1, n; j = 1, 6, the curvature maximum is now calculated considering 3 consecutive images of principal curvature of the scale-space (see Fig. 6.24). Recalling that each octave consists of 6 images, we get the following set of 4 images in each of the n octaves: M P12 M P13 M P14 M P15 M P22 M P23 M P24 M P25 ... ... ... ... M Pn2 M Pn3 M Pn4 M Pn5
(6.102)
where M Pi j = max(Pi j−1 , Pi j , Pi j+1 ). Figure 6.61b shows one of the images of maximum curvature M P created by maximizing the principal curvature in each pixel by considering three consecutive images of principal curvature in the scale-space. 4. Image segmentation of maximum principal curvature. From the images of the maximum principal curvature, obtained in the previous step, the regions with the characteristic of good stability are detected, through the watershed segmentation algorithm (see Sect. 5.7 in Chap. 5). The watershed algorithm is applied to binary images, gray levels images or the magnitude of the gradient of an image. In this context, the watershed algorithm is applied to the principal curvature images. To mitigate the noise, a grayscale morphological operation of closur e is applied to the curvature images followed by a hysteresis threshold. The closing operation is defined as f • b = ( f ⊕ b) b where f is the image M P obtained from (6.102), b is the circular structuring element of 5 × 5, ⊕ and are, respectively, the
6.11 Regions Detectors and Descriptors
415
Fig. 6.61 Detection of regions with the PCBR detector. a Original image; b principal curvature; c Binarized and filtered image with grayscale closing morphological operator; d Extraction of regions with watershed segmentation; and e Regions detected approximated with ellipses
operators (grayscale) of dilatation and closure. In particular, the closure operator removes small holes along the principal curvatures thus eliminating many local minima caused more by the noise that would generate spurious regions with the watershed segmentation algorithm. A further noise reduction procedure is applied to the curvature images to eliminate more spurious, unfiltered regions due to the limitations of the morphological closure operator. This is achieved through a thresholding operation applied to the principal curvature image thus creating a more accurate binarized principal curvature image. However, rather than applying a threshold directly or a hysteresis range, a more robust thresholding procedure (with hysteresis interval) is applied, supported by the eigenvectors that guide the connectivity analysis process of local geometric structures to better filter spurious structures. This is possible thanks to the analysis of the eigenvalues of the Hessian matrix which are directly related to the signal amplitude (in this case the contrast level of the lines and edges) and the principal curvature images have very small values at the low-contrast structures (lines and edges). Small low-contrast segments of these structures can cause interruptions in the threshold images of principal curvature causing erroneous fusions by the segmentation algorithm. This can be prevented through an analysis of the local geometry with the support of the eigenvectors that provide the right direction of the curvilinear structures, thus being more effective than the analysis of the values of the eigenvalues alone. The thresholds that define the hysteresis range are calculated as a function of the value of the eigenvectors (adequately normalized) by comparing the major (or minor) directions of the 8-proximity pixels. This thresholding procedure supported by the eigenvectors produces better accuracy in constructing continuous structures by filtering the variations induced by the noise, allowing to have more stable regions (see Fig. 6.61). 5. Elliptical approximation of the watershed regions. The final regions of interest extracted from the PCBR detector with the segmentation algorithm are then approximated to an elliptic shape through the transformation to principal components analysis (PCA) maintaining the same moments of the second order of the watershed regions (see Fig. 6.61). 6. Selection of stable regions. The main principal curvature images obtained from (6.102) represent an approach to detect stable regions. A more robust approach can be the one adopted by the MSER detector maintaining those regions that
416
6 Detectors and Descriptors of Interest Points
Fig. 6.62 Detection of regions with the PCBR method. a Regions extracted with the watershed segmentation applied directly to the image of the main curvature; b Watershed segmentation applied to the image of the principal curvature after filtering with morphological closing operator; c and d results of the PCBR detector applied to the first two images of the INRIA dataset
can be determined in at least three consecutive scales. In a similar way to the thresholding procedure used by the MSER detector, the selected regions are those that are stable through local scale changes. Figure 6.62 shows the results of the detector of regions of interest PCBR. The performance of the PCBR detector was evaluated in the [41] by comparing it with other detectors for different types of images and application contexts.
6.11.6 SISF—Scale-Invariant Shape Features Detector The SISF detector [43] is a detector of regions of interest based on the determination of circles at various scales and in different positions in the image, evaluating convex salient structures from the edges (extracted with Canny algorithm) through a measure that maximizes as a circle is supported with the surrounding edges. This type of detector is also among those that in the literature are called structured-based detectors based on the geometric structures present in the image such as lines, curved edges with which regions or points of interest are then extracted. The basic motivation of this detector is to extract regions based on the shape information which guarantees greater strength and repeatability than the intensity-based detectors. The first step of this approach is the extraction of the contours of the image using the Canny algorithm (as required for an EBR detector) for each level of the multiscale representation. The scale-space constructed is made up of images characterized by the scale parameter σ each with the set of contour points pi with including the gradient information gi normalized to 1. The second step foresees the definition of the circles with different radius (related to the scale factor σ ) centered in the positions c of each pixel of the images of the scale-space and the search for the contour points pi that are near the circle. For each point (c, σ ), a significant measure is defined as the weighted sum of the local contributions of the contour points with respect to the set of weights to consider the proximity to the circle and the alignment with its local tangent (see Fig. 6.63). This
6.11 Regions Detectors and Descriptors
417
Fig. 6.63 Detection of regions with the SISF method. Measurements used for the extraction of the regions
proximity measure with the Gaussian distance from the circle is defined as follows: (pi − c − σ )2 d (6.103) wi (c, σ ) ≡ exp − 2(sσ )2 where s indicates the location of the detection. For small values of s only points very close to the circle are considered. Normally s = 0.2 is used. The tangential distance is calculated with the scalar product between the local gradient of the contour point and the local radial vector of unit length: pi − c o = gi cos ∠(gi , pi − c) wi (c, σ ) ≡ gi · (6.104) pi − c The final weight wi (c, σ ) for the point (c, σ ) is given by the product of the weights of the nearest contour points wid (c, σ ) and aligned wio (c, σ ): wi (c, σ ) ≡ wid (c, σ )wio (c, σ )
(6.105)
A saliency measure C(c, σ ) for a circular region is given by the product of two contributions, the measure E(c, σ ) of the energy of the tangent edge and the measure H (c, σ ) of the entropy of the contour orientation. The first contribution measures how much the localized edges are contrasted and well aligned with the circle: E(c, σ ) ≡
N
wi (c, σ )2
(6.106)
i=1
where N indicates the number of contour points. The second contribution measures how much the circle has support from a wide distribution of points around its boundary (and not just by some points on one side of it): H (c, σ ) ≡ −
M
h(k, c, σ ) log h(k, c, σ )
(6.107)
k=1
where M indicates the number of intervals in which the orientation of the gradient is discretized.
418
6 Detectors and Descriptors of Interest Points
Fig. 6.64 Results of the region detector with the SISF method for object recognition
The author used in the experiments M = 32 intervals each of which contributes as follows: N M 1 (6.108) wi (c, σ )K k − h(k, c, σ ) ≡ oi 2π i wi (c, σ ) i=1
where oi indicates the orientation angle (in radians) of the gradient vector of the contour gi and K (x) is the Gaussian smoothing kernel, K (x) ≡ ex p(−x 2 /2). The final saliency measure for the circle is C(c, σ ) ≡ H (c, σ )E(c, σ )
(6.109)
This measure captures well the concept of convex saliency as E assumes large values when there are edge points close to the circle and well aligned with its tangent, while H is large when the edge points with high weight are well distributed in orientation around the circle. The maximum value is obtained when the contours of the image coincide exactly to a complete circumference centers in c and radius σ . The salient measure C(c, σ ), to characterize a region, obtained with (6.109) is motivated by a heuristic intuition rather than an appropriate theory. The results obtained with this detector are shown in Fig. 6.64.
6.12 Summary and Conclusions As described in the previous paragraphs, a large number of detectors have been developed that differ in the type of basic information used (methodological approach) or in relation to the extracted structures (corner, blob, edge, regions). Based on these
6.12 Summary and Conclusions
419
differentiations and characteristics required by the applications (in terms of speed, accuracy, repeatability, distinguishability, invariance, ...) it is then possible to make an appropriate choice of the detector. The choice of whether to use a detector capable of extracting or not invariant structures of an image must be appropriate considering that it normally requires greater computational load and a lower level of distinctiveness and less localization accuracy. For example, in calibration applications of an image acquisition system (TV-cameras, cameras) it is useful to choose a type of point of interest detector (corner or blob) located with good accuracy compared to the object recognition applications where instead detector must be more robust in capturing the variability of the appearance of more complex structures of interest (blobs and regions). In applications where the dynamics of the scene require a good reactivity of the system, in the choice of the detector, the computational load must be taken into account. More generally, the choice of a detector can be made according to some intrinsic properties. For example, choosing the repeatability property that we remember to be a detector’s ability to determine the same physical structures in multiple images, and this may conflict with location accuracy. In [20,30,44] some comparative evaluations of some detectors are given, considering repeatability as a discriminating element in the context of affine covariant transformations that take into account the change of viewpoints, scale, lighting, and artifacts generated by the encoding of compressed images (JPEG). In Table 6.1, the most used detectors that have been described in this chapter are summarized. For each detector, the following are indicated: the basic methodology, the characteristics of invariance, the extracted structures, the repeatability capacity, and the efficiency. Among the cor ner detectors, Harris’s detector is still widely used for its efficiency, the property of rotation invariance, location accuracy, and high repeatability. It exhibits a sensitivity to contrast. A variant of this detector, developed by Shi and Tomasi [45], is based on the analysis of the eigenvalues to improve its functionality considering the minimum of the eigenvalues as a more significant cornerness measure. The SUSAN detector [31] based on the use of a circular mask and the FAST detector [46] which uses the Bresenham circle, test the pixel under examination with respect to the pixels identified by the mask or on the circle to make the decision whether it is corner or not. SUSAN and FAST have the advantage of not calculating the derivatives resulting in the one hand very efficient (in particular FAST) but the absence of image smoothing makes them sensitive to noise. Although the Harris detector is still widely used the most recent version of FAST is well considered, particularly for real-time applications and high repeatability. The Hessian detector, like those based on the LoG and DoG, is less considered, if high execution speed is required, requiring the calculation of the second derivatives and the structures it detects (corner and blob) are less accurate in terms of location. All these detectors are to be considered if only rotation invariance is required, considering that for FAST and SUSAN this invariance is limited although they have high efficiency and repeatability. The Fast–Hessian detector component of SURF using the kernel box function and the properties of the integral image is particularly fast.
420
6 Detectors and Descriptors of Interest Points
Table 6.1 Detectors summary of points and regions of interest Detector
Methodology
Corn Blob Reg
Property R
*
Performance R A
Rb
V
*
***
***
***
**
*
**
**
**
*
S
A
Harris
Cornerness measure based on the second-order moments matrix
Hessian
Cornerness measure based on the Hessian matrix
SUSAN
Compare pixels under examination with those of the circular mask
*
*
**
**
**
***
FAST
Compare pixels under examination with those of the Bresenham circle
*
*
***
**
**
***
DoG
Difference of Gaussian images
*
*
*
*
**
**
**
**
LoG
Laplacian of Gaussian
*
*
*
*
**
**
**
**
HarrisLaplace
Harris cornerness measure and Laplacian of Gaussian as a multiscale function
*
*
*
*
***
***
**
*
SURF
Cornerness measure based on the Hessian matrix and scale-space realized with integral images and scalable rectangular filter
*
*
*
*
**
**
**
***
Harris-Affine Points of interest detected with HarrisLaplace and estimated affine elliptic region with the second-order moments matrix
*
*
*
*
*
***
***
**
**
*
MSER
Extract blobs/regions by means of watershed segmentation and intensity function characterizing the regions and its contours
*
*
*
*
*
***
***
**
***
PCBR
Extract blobs/regions from local structural information defined by contours and principal curvature
*
*
*
*
*
*
***
**
*
6.12 Summary and Conclusions
421
Table 6.1 (continued) HessianAffine
Points of interest detected with Hessian– Laplace and estimated affine elliptic region with the second-order moments matrix
*
*
SISF
Extract blobs/regions from salient local information defined by contours and their measure of convexity based on gradient and entropy
*
Salient Region
Regions characterized by salient local information based on the entropy of probability distribution function of image intensity values
*
IBR
It extracts affineinvariant regions based on the local properties of extreme variability of brightness in the image
*
*
*
*
*
*
*
*
*
*
*
*
***
***
***
**
*
***
**
*
*
*
*
**
*
*
*
*
**
*
When required, scale invariance is also the most widely used detector of Harris– Laplace as it exhibits good repeatability and inherits good localization accuracy from the Harris detector. In general, the scale-invariant detectors exhibit a less accurate location due to the search for corners in a 3D scale-space context. The Hessian– Laplace detector, scale invariant, turns out to be more robust than the simple Hessian detector benefiting from the multiscale analysis in the scale-space to determine the blobs even though the localization is less accurate. This latter aspect is not critical in object recognition applications, where the Hessian–Laplace detector is widely used due to the high repeatability percentage and a high number of extracted blobs. Multiscale DoG and LoG detectors are performing in applications where it is required to solve the problem of correspondence and in applications of information retrieval (e.g., image retrieval) from archives. This is due to the good compromise between accurate localization and accuracy of the estimated scale. When interested in extracting invariant scale structures and affine transformations the most used detectors are inherited from those of Harris and Hessian. For the Harris-Affine and Hessian-Affine invariant detectors, the same considerations made previously on those based on Harris and Hessian are valid. Both overcome some
422
6 Detectors and Descriptors of Interest Points
limitations (robustness to noise and inaccurate localization) highlighted by detectors invariant only to the scale, i.e, Harris–Laplace and Hessian–Laplace. In the context of affine-invariant detectors, the distinction between corner, blob, and region is very blurred. The choice of this type of detector is very much influenced by the application context, which generally involves a scenario with objects that change their appearance, due to situations that also change simultaneously in terms of scale, lighting, observation point, and movement. In these cases, we want to recognize the object among the many visible in the scene, or the same object with a different attitude or while it is in motion. An initial strategy may be to test simpler affine-invariant detectors (Harris-Affine or Hessian-Affine) in particular when we have isolated objects, and subsequently, to extract more complex invariant structures, detectors of salient regions can be chosen. The latter are computationally more expensive having to detect affine-invariant regions and additionally calculate, for each region, global information such as histograms and entropy. The first detectors developed are based on the extraction of edges from which a limited number of corners are detected but with good accuracy in terms of location and repeatability. Other developed region detectors rely on segmentation to extract uniform regions. Of these, MSER is among the best performing considering that the segmentation algorithm is of the watershed type, as well as the PCBR detector based on structures, initially extracting the principal curvature image. Once the structures of interest have been extracted (corners, blobs, and regions) and those with greater distinctiveness are filtered out, for different applications (stereo vision, tracking, object recognition, motion detection) it is useful to have a further local information of such structures and this is realized with the descriptor component, for example, the one provided by the SIFT, SURF, PCA-SIFT, GLOH methods. SIFT is the approach (invariant to the scale and rotation) most used both for the description and for the detection based on DoG images applied in particular for object recognition. A good alternative is given by SURF, structured as SIFT, but based on approximate kernel functions and integral images it is faster. SURF is invariant to scale and although being non-invariant to rotation it gets acceptable results with rotations within 15◦ . GLOH is another descriptor considered as an extension of SIFT. Several descriptors (not shown in this chapter) have been developed based on the shape (through the contour or the region), or on the color, or on the motion of the structure. SIFT and SURF sufficiently describe the local texture through the descriptor vector.
References 1. C.G. Harris, M. Stephens, A combined corner and edge detector, in 4th Alvey Vision Conference, ed. by C.J. Taylor (Manchester, 1988), pp. 147–151 2. G. Lowe David, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 91–110 (2004)
References
423
3. H. Moravec, Obstacle avoidance and navigation in the real world by a seeing robot rover. Technical report, Carnegie-Mellon University, Robotics Institute (1980) 4. C. Tomasi, T. Kanade, Detection and tracking of point features. Pattern Recognit. 37, 165–168 (2005) 5. B. Triggs, Detecting keypoints with stable position, orientation and scale under illumination changes, in Proceedings of European Conference on Computer Vision (2004), pp. 100–113 6. A. Noble, Finding corners. Image Vis. Comput. J. 6(2), 121–128 (1988) 7. P.R. Beaudet, Rotational invariant image operators, in Proceedings of the 4th International Conference on Pattern Recognition (1978), pp. 579–583 8. L. Florack, Image Structure, 1st edn. (Kluwer Academic Publishers, Berlin, 1997). ISBN 97890-481-4937-7 9. Koenderink Jan, The structure of images. Biol. Cybern. 50, 363–370 (1984) 10. J. Sporring et al., Gaussian Scale-Space Theory, 1st edn. (Kluwer Academic Publishers, Berlin, 1997). ISBN 978-90-481-4852-3 11. Tony Lindeberg, Scale-space theory: a basic tool for analysing structures at different scales. J. Appl. Stat. 21(2), 224–270 (1994) 12. A.P. Witkin, Scale-space filtering, in Proceedings of 8th International Joint Conference on Artificial Intelligence, Karlsruhe, Germany (1983), pp. 1019–1022 13. Tony Lindeberg, Feature detection with automatic scale selection. Int. J. Comput. Vis. 30(2), 77–116 (1998) 14. K. Mikolajczyk, Detection of local features invariant to affine transformations. Ph.D. thesis, Institut National Polytechnique de Grenoble, France (2002) 15. Tony Lindeberg, Lars Bretzner, Real-time scale selection in hybrid multi-scale. Springer Lect. Notes Comput. Sci. 2695, 148–163 (2003) 16. T.N. Hubel, D.H. Wiesel, Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. 160(1), 106–154 (1962) 17. J. Koenderink, A. van Doorn, Representation of local geometry in the visual system. Biol. Cybern. 57, 383–396 (1987) 18. Tony Lindeberg, Invariance of visual operations at the level of receptive fields. PLOS ONE 8(7), 1–33 (2013) 19. J.S. Beis, D.G. Lowe, Shape indexing using approximate nearest-neighbour search in highdimensional spaces, in CVPR Proceedings Conference on Computer Vision and Pattern Recognition (1997), pp. 1000–1006 20. Mikolajczyk Cordelia Schmid Krystian, A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1615–1630 (2005) 21. E. Rosten, T. Drummond Fusing points and lines for high performance tracking, in Proceedings of the International Conference on Computer Vision (2005), pp. 1508–1511 22. L. van Gool, H. Bay, B. Fasel, Interactive museum guide: fast and robust recognition of museum objects, in Proceedings of the First International Workshop on Mobile Vision (2006) 23. T. Tuytelaars, H. Bay, L.V. Gool, Surf: speeded up robust features, in Proceedings of European Conference on Computer Vision (2006), pp. 404–417 24. T. Tuytelaars, H. Bay, A. Ess, L. Van Gool, Speeded-up robust features (surf). Int. J. Comput. Vis. Image Understand. 110(3), 346–359 (2008) 25. M.J. Jones, P.A. Viola, Rapid object detection using a boosted cascade of simple features, in Proceedings of Conference on Computer Vision and Pattern Recognition (2001), pp. 511–518 26. K. Mikolajczyk, E.C. Schmid, Indexing based on scale invariant interest points, in Proceedings of the 8th International Conference on Computer Vision (2001), pp. 525–531 27. K. Mikolajczyk, C. Schmid, Scale affine invariant interest point detectors. Int. J. Comput. Vis. 60(1), 63–86 (2004) 28. T. Lindeberg, J. Garding, Shape-adapted smoothing in estimation of 3d depth cues from affine distortions of local 2d structure. Image Vis. Comput. 15, 415–434 (1997)
424
6 Detectors and Descriptors of Interest Points
29. J. Garding, T. Lindeberg, Direct computation of shape cues using scale-adapted spatial derivative operators. Int. J. Comput. Vis. 17(2), 163–191 (1996) 30. C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, K. Mikolajczyk, T. Tuytelaars, L. Van Gool, A comparison of affine region detectors. Int. J. Comput. Vis. 65(1/2), 43–72 (2005) 31. S.M. Smith, J.M. Brady, Matching widely separated views based on affine invariant regions. Int. J. Comput. Vis. 23(1), 45–78 (1997) 32. M. Hedley, M. Trajkovic, Fast corner detection. Image Vis. Comput. 16, 75–87 (1998) 33. E. Rosten, T. Drummond, Machine learning for high speed corner detection, in Proceedings of European Conference on Computer Vision (2006), pp. 430–443 34. Rosten Reid Porter Edward, Drummond Tom, Faster and better: a machine learning approach to corner detection. IEEE Trans. Pattern Anal. Mach. Intell. 32(1), 105–119 (2010) 35. J.R. Quinlan, Induction of decision trees. Mach. Learn. 1, 81–106 (1986) 36. M. Urban, J. Matas, O. Chum, T. Pajdla, Robust wide baseline stereo from maximally stable extremal regions, in Proceedings of British Machine Vision Conference (2002), pp. 384–396 37. T. Tuytelaars, L. Van Gool, Matching widely separated views based on affine invariant regions. Int. J. Comput. Vis. 59(1), 61–85 (2004) 38. A. Zisserman, T. Kadir, M. Brady, An affine invariant salient region detector, in Proceedings of the 8th European Conference on Computer Vision (2004), pp. 345–457 39. Y. Hu, V. Gopalakrishnan, D. Rajan, Salient region detection by modeling distributions of color and orientation. IEEE Trans. Multimed. 11(5), 892–905 (2009) 40. T. Tuytelaars, L. Van Gool, Content-based image retrieval based on local affinely invariant regions, in Proceedings of International Conference on Visual Information Systems (1999), pp. 493–500 41. W. Mortensen, E. Dietterich, T. Shapiro, L. Deng, H. Zhang, Principal curvature-based region detector for object recognition, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2007), pp. 1–8 42. C. Steger, An unbiased detector of curvilinear structures. IEEE Trans. Pattern Anal. Mach. Intell. 20(2), 113–125 (1998) 43. F. Jurie, C. Schmid, Scale-invariant shape features for recognition of object categories, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2004), pp. 90–96 44. Tuytelaars Krystian Mikolajczyk Tinne, Local invariant feature detectors: a survey. Found. Trends Comput. Graph. Vis. 3(3), 177–280 (2007) 45. J. Shi, C. Tomasi, Good features to track, in Proceedings of the Conference on Computer Vision and Pattern Recognition (1994), pp. 593–600 46. R. Porter, E. Rosten, T. Drummond, Faster and better: a machine learning approach to corner detection. IEEE Trans. Pattern Anal. Mach. Intell. 32(1), 105–119 (2009)
Index
Symbols linear transformation 2D discrete, 73 based on Eigenvectors and Eigenvalues, 95 orthogonal, 70 orthonormal, 71 unitary, 70 2D sinusoidal transforms DCT-Discrete Cosine, 78 direct DFT, 77 DST-Discrete Sine, 82 Inverse DFT (IDFT), 77 4-connectivity, 291 8-connectivity, 291 8-neighbors, 29 A Adaptive Median Filter Algorithm, 226 additive noise model ergotic stochastic process, 213 Erlang (gamma), 215 exponential, 216 normal stationary stochastic process, 212 Rayleigh, 214 uniform, 217 white, 212 aliasing phenomenon, 183 analysis process, 75 anisotropic diffusion, 266 anisotropic response, 338 anisotropic scaling, 164 atmospheric turbulence blurring, 238 augmented vector, 150 autocorrelation function, 246
autocorrelation matrix, 344, 348, 382 B base vectors, 71, 97, 156 basis images, 75 DCT, 79 DFT, 77 Haar, 93 Hadamard, 86 PCA/KLT, 104 Slant, 91 BBF-Best Bin First, 367 bilateral filter, 267 binary image, 14, 43 blind deconvolution, 235, 264 blind restoration, 210 blurring, 211 Bresenham circle, 397 C Canny algorithm, 44 CLS-Constrained Least Squares filter, 259 clustering, 320 color space HSI, 288 HSV, 288 L*a*b*, 330 L*u*v*, 330 RGB, 288, 322 comb function, see Shah pulse train function compass edge detector, 22 complete inverse filter, 241, 254 connected components labeling algorithm, 295
© Springer Nature Switzerland AG 2020 A. Distante and C. Distante, Handbook of Image Processing and Computer Vision, https://doi.org/10.1007/978-3-030-42374-2
425
426 contour, 272 closed path, 291 simple path, 291 contour extraction algorithms, 290 border following, 290 control points, 177, 180 convolution mask, 5 convolution operator, 10, 70 convolution theorem, 49, 185, 239 corner fast detectors FAST, 399 SUSAN, 395 Trajkovic-Hedley segment test, 397 corner response, see cornerness measure cornerness measure DET, 349 harmonic mean, 348 Harris–Stephens, 343 KLT, 348 Moravec, 336 cornerness map, 336 correlation matrix, 113, 348 covariance matrix, 98, 99, 103, 104 cross-correlation function, 248 cross-ratio, 169 cut-off frequency, 49, 64 D deconvolution, 211 dehazing, 268 DFT-Discrete Fourier Transform, 69 DHT-Discrete Hartley Transform, 82 diffusion equation, 266 diffusion tensor, 266 DLT-Direct Linear Transformation, 170 DoG-Difference of Gaussians, 37, 356 DPCM-Differential Pulse Code Modulation, 107 DWT applications edge extraction, 143 image compression (JPEG 2000), 141 image noise attenuation, 142 DWT-Discrete Wavelet Transform, 134 E EBR-Edge Based Region, 410 edge line discontinuity, 2 ramp, 2 roof, 2 step discontinuity, 1
Index edge enhancement filter, see also sharpening operator, 10 edge linking algorithm, 292 edge orientation map, 11 eigenspace, 95, 103 eigenvalues and eigenvectors, see linear transformation Epanechnikov kernel, 325 F face recognition, 114 FAST detector machine learning paradigms, 400 performance evaluation, 402 FAST Enhanced Repeatibility, 402 FAST-Features from Accelerated Segment Test, 399 feature descriptors, 335 feature detectors, 334 feature matchers, 335 FFT-Fast Fourier Transform, 83 FHT-Fast Hartley Transform, 83 filter band-pass, 31, 54 Butterworth band-pass, 55 Butterworth band-stop, 54 Butterworth high-pass, 50 DoG, 37 DoG band-pass, 56 DroG, 24 Gaussian band-pass, 56 Gaussian band-stop, 54 Gaussian high-pass, 52 high-pass, 31, 48, 137 ideal band-pass, 55 ideal band-stop, 52 ideal high-pass, 49 Laplacian, 57 LoG, 31 low-pass, 5, 10, 31, 137 frequency-domain periodic noise reduction, 228 Butterworth notch stop filter, 231 Gaussian notch stop filter, 231 ideal notch stop filter, 230 notch pass filter, 232 optimum notch filter, 232 FT-Fourier Transform, 69
Index G Gabor function, see STFT-Short-Time Transform Fourier gamma distribution, 215 Gaussian filter, 17, 353 geometric affine transformations, 160 compound, 164 generalized, 161 of scaling, 164 of shearing, 164 of similarity, 161 geometric operator, 149 direct projection, 151 inverse projection, 153 magnification or reduction, 154 rotation, 155 skew, 155 specular, 156 translation, 154 transposed, 156 geometric and photometric image registration, 181 geometric image registration, 177 GLOH-Gradient Location and Orientation Histograms, 368 GM-Geometric Mean filter, 260 gradient filter, 5 convolution masks, 8 directional derivative, 7 finite differences approximated, 8 isotropic, 7 symmetric differences, 10 gradient vector direction, 6 directional, 7 magnitude, 6 gradient properties, 7 H Haar wavelet transform, 376 hard threshold, 143 Harris affine detector, 386 elliptical shape, 388 iterative adaptation algorithm, 391 iterative shape adaptation, 390 matching measure, 393 repeatability measure, 393 Harris corner detector procedure, 344 Harris interest points partial photometric invariance, 347 rotation invariant, 347
427 Hermitian matrix, 70 Hessian detector multiscale, 385 Hessian affine detector, 393 homogeneous geometric transformations rotation, 158 roto-translation, 158 translation, 158 homogeneous coordinates, 149, 156, 175 homographic transformation applications camera calibration, 173 image morphing, 173 image mosaicing, 173 image registration, 173 image warping, 173 panoramic mosaic, 173 perspective distortions removal, 172 homography transformation, 167 homomorphic filter, 258 Hotelling transform, see PCA-Principal Component Analysis Hough transform circles detection, 303 generalized, 305 lines detection, 299 randomized, 307 hysteresis, 43, 414 I IBR-Intensity extrema-Based Regions, 407 IID-Independent and Identically Distributed, 212 ill-posed problem, 182 image degradation function estimation, 235 by experimentation, 236 by observation, 235 by physical-mathematical model, 236, 238 image resampling, 182 image restoration, 211 impulse noise model bipolar, 217 unipolar, 217 impulse response, see PSF-Point Spread Function inner product, 7, 71, 74, 76, 135 integral image, 370 interest points descriptor GLOH, 368 SIFT, 362 SURF, 376 interest points detector
428 fast Hessian, 371 Harris–Stephens, 340 Harris–Stephens with variants, 348 Harris-Laplace, 381 Hessian, 349 Hessian–Laplace, 384 Moravec, 335 SIFT, 357 SURF, 370 interest points and region detectors in comparison, 418 interpolation operator, 149 B-spline, 197 bicubic, 194 biquadratic, 192 cubic B-spline, 200 first order or bilinear, 190 ideal, 184 least squares, 200 non-polynomial, 201 operators comparison, 202 quadratic B-spline, 199 zero order, 189 interpolation impulse response, 183 inverse filter application, 242–244 inverse filtering, 239 ISF-Scale Invariant Shape Features, 416 isometric transformations, 159 isotropic diffusion, 266 J JPEG-Joint Photographic Experts Group, 80 K K-means, 321 kernel box, 372, 373 kernel box function, 421 key points, see PoI-Points of Interest L Lanczos sinc window, 201 Laplace operator, 28, 259 Laplacian filter, see Laplace operator Laplacian operator, 26, 278 LED local operator Canny, 39 directional gradient, 22 DoG filter, 37 Frei & Chen, 16 Gaussian derivative, 24 gradient based, 11
Index gradient-threshold, 13 LoG filter, 30 Prewitt, 16 Roberts, 12 second order differentiation, 26 second-directional derivative, 38 Sobel, 14 LED operators comparison, 16 LED-Local Edge Detector, 1 line detection, 47 linear transform, 69 linear transformations, 151 LoG-Laplacian of Gaussian, 31, 353 look-up table, 40, 153, 297 M Markov-1 process, 103 Markovian matrix, 105 matrix diagonalizable, 97 Hessian, 349 identity, 71 kernel, 73 orthogonal, 70 rotation, 109, 342 similarity, 97 spectrum, 95 symmetric, 73, 97 Vandermonde, 195 mean-shift for filtering, 330 for ftracking, 331 mean-shift algorithm, 321 MLE-Maximum Likelihood Estimation, 264 Moravec algorithm, 337 mother wavelet function, 125, 126 motion blurring, 236 MSE-Mean Squared Error, 103, 110 MSER region detector affine invariant, 406 multiscale, 406 MSER-Maximally Stable Extremal Regions, 404 N NMS-Non-Maximum Suppression, 41, 337, 397, 402 noise removal filter, 219 adaptive, 225 adaptive median, 226 alpha-trimmed mean, 222
Index contraharmonic mean, 221 geometric mean, 219 harmonic mean, 220 max and min, 221 median, 221 midpoint, 221 order statistics, 221 non-linear transformations, 151, 154, 178 nonlinear diffusion filter, 265 nonlinear iterative deconvolution filter, 262 Nyquist-Shannon theorem, 131, 184 O optical image deformation linear, 179 nonlinear (barrel), 179 nonlinear (pincushion), 179 optimal filter estimation, 244 Wiener, 244 Wiener filter with uncorrelated noise & original image, 249 outer product, 78, 87, 121 oversampling, 183 P parametric Wiener filter, see GM-Geometric Mean filter Parseval’s theorem, 258 Parzen windows, 323 PCA-Principal Component Analysis, 98 data compression, 103 dimensionality reduction, 109 eigenface, 114 principal axes computation, 109 significant components calculation, 111 PCA/KLT-Karhunen–Love Transform, see PCA-Principal Component Analysis PCBR-Principal Curvature Based Region, 412 periodic noise, 217 perspective transformation, 174 piecewise interpolating function, 197 pinhole camera, 174 PoI-Points of Interest, 333 point detection, 47 power spectrum magnitude, 247 phase, 247 principal axes, 101 projective plane, 150 projective space, 151
429 projective transformation, see homography transformation PSE-Power Spectrum Equalization filter, 257 pseudo-inverse filter, 241, 255 PSNR-Peak Signal-to-Noise Ratio, 203, 262 Q QMF-Quadrature Mirror Filter, 134 quadratic form, 97, 341 R random ergodic variable, 245 rank, 121 RANSAC-RANdom SAmple Consensus, 308, 407 refinement equation, 134 region homogeneity, 272, 273 spatially connected, 273 region detectors Affine Salient, 408 EBR, 410 IBR, 407 MSER, 404 PCBR, 412 SISF, 416 region filling algorithm, 297 regions detectors and descriptors, 403 resampled function, 184 ringing artifacts, 51, 201 RMSE-Root Mean Square Error, 203 S sampling function, 184 sampling theory, 183 scale invariant interest points, 350 scale-space function, 353 scale-space representation, 353 scale-space theory, 353 scaling function, 131, 134, 140 segmentation methods based on contour, 289 based on local statistical, 287 based on multi-band threshold, 287 by adaptive threshold, 284 by background normalization, 285 by dynamic threshold, 275 by gradient & Laplacian, 277 by histogram analysis, 276 by iterative automatic threshold, 278 by iterative threshold seclection, 279
430 by local threshold, 275 by Otsu algorithm, 280 by sub-images decomposition, 284 by watershed transform, 315 region based, 309 region-growing, 309 region-splitting, 312 split-and-merge, 313 using clustering algorithms, 319 using K-means, 321 using mean-shift, 328 segmentation based on objects/background ratio, 275 segmentation by global threshold background, 273 foreground, 273 segmentation process, 272 Shah pulse train function, 185 sharpening operator high frequency emphasis filter, 64 high-boost filter, 62 homomorphic filter, 64, 218 linear filter, 58 unsharp masking, 60 SIFT descriptor biological motivation, 366 computational complexity, 367 contrast invariant, 366 descriptors generation, 364 dominant orientation detection, 363 SIFT detector DoG pyramid, 358 extreme points filtering, 361 interest point localization, 358 local extremes detection, 358 scale space, 358 SIFT-Scale Invariant Feature Transform, 357 simulated annealing, 402 sinc interpolation function, 187 smoothing filters, 5, 9 smoothing spline, 197 SNR-Signal-to-noise ratio, 252 soft threshold, 143 speckle noise, 218 spectral power, 260 spline, 197 SSD-Sum of Squared Differences, 335 SSD-Sum of Squared Distances, 319 STFT-Short-Time Transform Fourier, 124 structure tensor, 341 subband coding, 131
Index SURF descriptor descriptor vector calculation, 377 dominant direction detection, 376 SURF detector-descriptor precision, 380 recall, 380 SURF detector fast Hessian, 371 integral image, 370 scale space, 374 SURF-Speed-Up Robust Features, 369 SUSAN-Smallest Univalue Segment Assimilating Nucleus, 395 SVD-Singular Value Decomposition, 171, 180 SVD-Singular Value Decomposition transform, 120 synthesis process, 75 T transfer function, 5, 9 transform based on rectangular functions Haar, 93 Hadamard, 84 Slant, 91 Slant–Hadamard, 92 Walsh, 89 transform coefficients, 70 transformation separability, 166 U unitary matrix, 70 unitary transformations properties, 76 W warping transformations, 151 watershed transform, 315 based on flooding simulation, 315 using markers, 317 wavelet biorthogonal, 140 Daubechies, 138 mother, 127 orthonormal, 134 Haar, 139 wavelet transforms, 123 band-pass filter, 128 biorthogonal, 139 continuous 1D-CWT, 125 continuous 2D-CWT, 127 discrete 1D-DWT, 129 discrete 2D-DWT2, 134
Index fast FWT, 131 inverse continuous 1D-ICWT, 127 inverse continuous 2D-ICWT, 127 inverse discrete 1D-IDWT, 131 inverse discrete 2D-IDWT2, 138
431 Wiener filter application: 1D example, 254 Wiener filter application: 2D example, 255 Z zero crossing, 26, 188